TpuTopology & TpuCoreLocation
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (buildlibtpu_lts_20260413_b_RC00, build-id md589edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, ELF x86-64 DYN, not stripped; IDA-recovered C names and demangled C++ symbols quoted verbatim)..textVMA equals file offset. Other versions will differ.
Abstract
TpuTopology_* and TpuCoreLocation_* are the C-ABI accessor roster that exposes the physical torus geometry of a TPU pod to the open-source StreamExecutor backend. They are the thinnest layer in the topology stack: each is a one- or two-line extern "C" free function that reads a single field of a tpu::TpuTopology (a.k.a. TpuTopologyExternal) or a tpu::TpuCoreLocation object — chip-bounds extents, host/chip counts, per-core-type device counts, a chip-coordinate triple, a core index, a global id. Recovered from learning/45eac/tfrc/executor/stream_executor/tpu_util_c_api.cc references in the binary, they are the C face of the geometry that the on-device runtime discovers and the AOT compiler consumes.
These accessors sit beneath two richer topology surfaces this wiki documents elsewhere, and the contrast is the point. The PJRT TopologyDescription extension (type 16) is the modern, framework-facing topology object: a 272-byte struct of 31 function pointers (ChipBounds, CoreCountPerChip, ProcessBounds, slice configs) that wraps an xla::PjRtTopologyDescription and bounces through its vtable. The TpuTopology_* roster here is the legacy SE C-ABI underneath: no args-struct size negotiation, no PJRT_Error return, no vtable bounce — just return *(uint32*)(topology + offset). The PJRT extension and the SE roster expose overlapping geometry (both have a notion of chip bounds and cores-per-chip), but they are independent C surfaces reaching the same silicon facts by different paths. This page owns the SE roster and the geometry-field map; the PJRT page owns the extension.
The roster is reached through the ExecutorApiFn() function-pointer table exactly like every other SE C-ABI cluster (the accessor pattern is documented on the shim overview). The host-side C++ wrapper tensorflow::tpu::TpuTopologyExternal::LogicalDevicesPerHost (0x20818F80) is literally ExecutorApiFn()[+632](handle, core_type) — it loads the table, indexes the slot, and calls the plugin-side TpuTopology_LogicalDevicesPerHost (0xEABBFC0) which performs the field read. A reimplementer reproduces both halves: the wrapper that indexes a slot, and the C-ABI impl that reads an offset.
For reimplementation, the contract is:
- The roster — 17
TpuTopology_*and 4TpuCoreLocation_*extern "C"functions, their addresses, and the C++ accessor or struct field each backs. - The geometry-field map — which byte offset of
tpu::TpuTopology/tpu::TpuCoreLocationeach accessor reads (ChipBounds_X→+88,HostCount→+108,Index→+52, …), and the per-core-type strided array at+124/+128. - The core-type remapping quirk —
LogicalDevicesPerHost/PerChipfold theTpuCoreTypeenum throughif t != 1: t = 2*(t==2)before dispatch;Versionmaps the internal codename ordinal{0,1,2,3}→{1,2,3,4}and returns0for anything else. - The two output conventions — scalar return (
return *(uint32*)(...)) versus the coordinate-triple out-param form (ChipCoordinates/HostCoordinateswrite three_DWORD*out-params).
| Roster prefixes | TpuTopology_* (17), TpuCoreLocation_* (4) |
| C-ABI block | TpuTopology_* @ 0xEABBFC0–0xEABC2A0; TpuCoreLocation_* @ 0xEABC2C0–0xEABC340 |
| Availability extras | TpuTopology_AvailableCoreCount/CoresPerChip/MaybeAvailableSparseCoresPerLogicalDevice @ 0xF6A1CA0–0xF6A1EA0 |
| Source file (recovered) | …/stream_executor/tpu_util_c_api.cc (from .rodata log strings) |
| Reached via | ExecutorApiFn() slots (e.g. LogicalDevicesPerHost = ExecutorApiFn()[+632]) |
| Backing C++ object | tpu::TpuTopology (== tensorflow::tpu::TpuTopologyExternal) / tpu::TpuCoreLocation |
| Output forms | scalar uint32 return; coordinate-triple _DWORD* out-params |
| Evidence grade | Reimplementation-grade / byte-confirmed against IDA decompile |
Scope — the
*ApiFn()accessor pattern, the opaque-handle convention, and the roster map are on the shim overview (link, not re-derived). TheTpuPlatform_*/TpuNodeContext_*lifecycle that constructs the topology is on TpuPlatform & TpuNodeContext. The modern PJRT topology object is on TopologyDescription Extension. On-device discovery of the torus shape is on ICI Topology Discovery. The fulltpu::TpuTopologystruct layout (theTarget+0x3b8per-codename geometry that these accessors index) is owned by the silicon pages — SparseCore Architecture — and is summarised here only as far as the offsets these accessors touch.
1. The Accessor Pattern, Specialised
Purpose
Every function on this page is the plugin-side half of one SE shim call. The host-side half is a TpuTopologyExternal::<Method> C++ wrapper that loads the ExecutorApiFn() table and calls a fixed slot. The two halves are linked at init (the bootstrap fills the slot with the address of the matching TpuTopology_* impl), and only the TfTpu_ExecutorApiFn* pointer crosses the ABI seam. The topology object itself never crosses as a C++ type — the host holds it as an opaque void* and passes it back on every call.
Entry Point
The worked path for LogicalDevicesPerHost, confirmed in the decompile:
xla / SE TPU backend
└─ tensorflow::tpu::TpuTopologyExternal::LogicalDevicesPerHost(core_type) 0x20818F80 (host-side C++ wrapper)
└─ tbl = stream_executor::tpu::ExecutorApiFn() 0x20819360 — singleton fn-ptr struct
└─ tbl[+632]( *(void**)this, core_type ) — call through the slot
│ (slot populated at init to point at:)
▼
TpuTopology_LogicalDevicesPerHost 0xEABBFC0 — plugin-side C-ABI impl
└─ tpu::TpuTopology::LogicalDevicesPerHost(topo, core_type') 0x20AD3920 — after enum remap
NOTE — the wrapper dereferences the handle once (
*(void**)this) before passing it: theTpuTopologyExternalhost object holds the realtpu::TpuTopology*at its+0, and that inner pointer is thea1everyTpuTopology_*impl receives. The accessors index offsets relative to that inner object, not the host wrapper.
Two output conventions
The roster splits cleanly into two call shapes, and a reimplementer must match each exactly:
// (A) scalar return — the majority. Reads one uint32 field and returns it.
// e.g. TpuTopology_HostCount @ 0xEABC000
uint32 TpuTopology_HostCount(void* topo):
return *(uint32*)(topo + 108)
// (B) coordinate-triple out-param — TpuCoreLocation_ChipCoordinates / _HostCoordinates.
// Three int32* out-params plus a redundant scalar return: _HostCoordinates returns
// the x read (loc+8), _ChipCoordinates returns the last-written triple element (z).
// e.g. TpuCoreLocation_HostCoordinates @ 0xEABC300
int32 TpuCoreLocation_HostCoordinates(void* loc, int32* x, int32* y, int32* z):
*x = *(uint32*)(loc + 8); *y = *(uint32*)(loc + 12); *z = *(uint32*)(loc + 16)
return *(uint32*)(loc + 8)
There is no PJRT_Error/absl::Status plumbing on this surface and no args-struct size check — those belong to the PJRT layer above. The SE C-ABI trusts its caller and reads the field unconditionally; the only validity gating is the BUG()/return 0 guards inside the availability and version accessors (§4).
2. The TpuTopology_ Roster
Function Map
The 14 canonical accessors form a contiguous block 0xEABBFC0–0xEABC2A0; the three "available" accessors live in a separate block at 0xF6A1CA0–0xF6A1EA0 because they first fetch the topology from the mesh/ops-util singleton rather than receiving it as a1. "Reads" names the field offset on tpu::TpuTopology (the inner object) or the C++ member it thunks to.
| Function | Addr | Reads / Backs | Output |
|---|---|---|---|
TpuTopology_LogicalDevicesPerHost | 0xEABBFC0 | tpu::TpuTopology::LogicalDevicesPerHost (0x20AD3920) after enum remap | scalar |
TpuTopology_LogicalDevicesPerChip | 0xEABBFE0 | tpu::TpuTopology::LogicalDevicesPerChip after enum remap | scalar |
TpuTopology_HostCount | 0xEABC000 | *(uint32*)(topo + 108) | scalar |
TpuTopology_ChipsPerHost | 0xEABC020 | *(uint32*)(topo + 116) | scalar |
TpuTopology_ChipBounds_X | 0xEABC040 | *(uint32*)(topo + 88) | scalar |
TpuTopology_ChipBounds_Y | 0xEABC060 | *(uint32*)(topo + 92) | scalar |
TpuTopology_ChipBounds_Z | 0xEABC080 | *(uint32*)(topo + 96) | scalar |
TpuTopology_HasChip | 0xEABC0A0 | tpu::TpuTopology::HasChip(topo, x, y, z) | bool |
TpuTopology_CoreForId | 0xEABC0E0 | thunk → tpu::TpuTopology::LogicalDeviceForId | TpuCoreLocation |
TpuTopology_Core | 0xEABC100 | tpu::TpuTopology::Core(topo, type, x, y, z, idx) | TpuCoreLocation |
TpuTopology_NumCores | 0xEABC140 | tpu::TpuTopology::logical_devices() (count) | scalar |
TpuTopology_Cores | 0xEABC160 | tpu::TpuTopology::logical_devices() → fills TpuCoreLocation*[] | array fill |
TpuTopology_IdForHost | 0xEABC260 | tpu::TpuTopology::IdForHost(topo, x, y, z) | scalar |
TpuTopology_Version | 0xEABC2A0 | **(uint32**)(topo + 8) → codename ordinal, remapped | scalar |
TpuTopology_AvailableCoreCount | 0xF6A1CA0 | *(uint32*)(GetTpuTopology() + 12*type + 128) | scalar |
TpuTopology_AvailableCoresPerChip | 0xF6A1DE0 | *(uint32*)(GetTpuTopology() + 12*type + 124) | scalar |
TpuTopology_MaybeAvailableSparseCoresPerLogicalDevice | 0xF6A1EA0 | xla::jellyfish::NumEmbeddingDevices(...) (StatusOr | StatusOr |
QUIRK — the roster mixes pure field reads with member thunks, and the distinction is critical for a reimplementer.
ChipBounds_X/Y/Z,HostCount,ChipsPerHostare flat*(uint32*)(topo+off)reads — the bounds and counts are pre-materialised scalars on the topology object. ButHasChip,Core,IdForHost,CoreForIdcall intotpu::TpuTopologymember functions that walk the chip/core layout. Copying only the field offsets reproduces the cheap accessors but not the lookups; copying only the thunks misses that the common geometry is a flat read with no computation.
The geometry-field map
The offsets these accessors touch are a window into the tpu::TpuTopology layout (full struct on the silicon pages). What is directly confirmed by the C-ABI bodies:
| Offset | Field | Read by | Notes |
|---|---|---|---|
+8 | Target* (embedded) | Version (double-deref **) | first uint32 of Target is the codename ordinal |
+52 | core-on-chip index | TpuCoreLocation_Index (on the core-location, see §3) | — |
+88 | chip_bounds.x | ChipBounds_X | torus extent, X axis |
+92 | chip_bounds.y | ChipBounds_Y | torus extent, Y axis |
+96 | chip_bounds.z | ChipBounds_Z | torus extent, Z axis |
+108 | host_count | HostCount | total hosts in the pod |
+116 | chips_per_host | ChipsPerHost | chips attached to one host |
+124 + 12·type | available cores-per-chip[type] | AvailableCoresPerChip | per-core-type strided entry |
+128 + 12·type | available core-count[type] | AvailableCoreCount | per-core-type strided entry |
NOTE — the
+124/+128reads with a12·typestride are the same per-core-type geometry array the silicon pages describe at theTarget+0x3b8offset within the larger structure: eachTpuCoreType(TensorCore / SparseCore / embedding) gets a 12-byte record holding its cores-per-chip and core-count. The C-ABI exposes two of the three record fields. The full record and the codename→geometry table are owned by SparseCore Architecture; this page confirms only thatAvailableCoresPerChipandAvailableCoreCountindex it bytypewith a 12-byte stride and atype < 3bound (BUG()past it).
Algorithm — the two non-trivial bodies
function TpuTopology_Version(topo): // 0xEABC2A0
ordinal = **(uint32**)(topo + 8) // first u32 of the embedded Target
if ordinal < 4:
return ordinal + 1 // internal {0,1,2,3} -> public {1,2,3,4}
return 0 // unknown codename -> 0
function TpuTopology_AvailableCoreCount(mesh_state, core_type): // 0xF6A1CA0
if mesh_state != NULL:
topo = tensorflow::TpuMeshCommonState::tpu_topology(*mesh_state) // VLOG "from mesh_state"
else:
topo = tensorflow::tpu_ops_util::GetTpuTopology() // VLOG "from tpu_ops_util"
if core_type >= 3: BUG() // only 3 core types
return *(uint32*)(topo + 12*core_type + 128) // strided per-type read
GOTCHA —
AvailableCoreCounttakes the mesh state, not a topology handle, and resolves the topology two different ways depending on whethermesh_stateis null (theelsebranch falls back to the process-globaltpu_ops_util::GetTpuTopology()). The siblingAvailableCoresPerChip/MaybeAvailableSparseCoresPerLogicalDeviceonly take the ops-util path and return a default (4) or anabsl::Statuserror ("TPU system is not available") when no topology is registered. A reimplementer who treats all three "available" accessors as pure field reads off a passed-in topology will mis-handle the no-device case these functions are specifically written to survive.
The core-type remap
LogicalDevicesPerHost and LogicalDevicesPerChip fold the incoming TpuCoreType enum before dispatch:
function TpuTopology_LogicalDevicesPerHost(topo, core_type): // 0xEABBFC0
if core_type != 1:
core_type = 2 * (core_type == 2) // {0->0, 1->1, 2->2, other->0}
return tpu::TpuTopology::LogicalDevicesPerHost(topo, core_type) // 0x20AD3920
The remap collapses every enum value other than 1 and 2 to 0, normalising an out-of-range or "default" core type to the first slot. This is a defensive clamp at the C-ABI boundary, where the host may pass a TpuCoreType from a newer/older enum the plugin does not recognise; the inner member function then indexes a fixed three-entry table without an out-of-bounds read.
Considerations
TpuTopology_Cores (0xEABC160) is the one accessor with a non-trivial body: it fetches the logical_devices() span and emits a TpuCoreLocation* for each, materialising result + 56·i pointers (the TpuCoreLocation stride is 56 bytes) into a caller buffer. The decompile vectorises the pointer-stride write with AVX (vpmuludq/vpaddq), but the semantics are a simple for i in 0..n: out[i] = base + 56*i. The 56-byte stride is the confirmed sizeof(tpu::TpuCoreLocation) and is consistent with the field offsets §3 reads (+52 is the last 4-byte field). TpuTopology_NumCores returns that same span's element count and is the size a caller passes to _Cores — the classic count-then-fill C-ABI pair.
3. The TpuCoreLocation_ Roster
Purpose
A tpu::TpuCoreLocation is the per-core coordinate record minted by TpuTopology_Core / _CoreForId / _Cores: where one logical device sits in the torus. The four TpuCoreLocation_* accessors expose its fields to the host — two coordinate triples (chip and host), a core-on-chip index, and a flattened global id. The object is 56 bytes (confirmed by the _Cores stride); these accessors read its leading and trailing scalar fields.
Function Map
| Function | Addr | Reads / Backs | Output |
|---|---|---|---|
TpuCoreLocation_ChipCoordinates | 0xEABC2C0 | tpu::TpuCoreLocation::chip_coordinates() → (x,y,z) | 3× int32* out + scalar |
TpuCoreLocation_HostCoordinates | 0xEABC300 | *(uint32*)(loc + 8/12/16) → (x,y,z) | 3× int32* out + scalar |
TpuCoreLocation_Index | 0xEABC320 | *(uint32*)(loc + 52) | scalar |
TpuCoreLocation_Id | 0xEABC340 | thunk → tpu::TpuCoreLocation::LogicalDeviceId | scalar |
Field map
| Offset | Field | Read by |
|---|---|---|
+0 | chip coords / id base (via member fn) | ChipCoordinates, Id |
+8 | host_coord.x | HostCoordinates (loc[2]) |
+12 | host_coord.y | HostCoordinates (loc[3]) |
+16 | host_coord.z | HostCoordinates (loc[4]) |
+52 | core-on-chip index | Index |
Algorithm
function TpuCoreLocation_ChipCoordinates(loc, x_out, y_out, z_out): // 0xEABC2C0
int32 tmp[3]
tpu::TpuCoreLocation::chip_coordinates(&tmp) // member computes the chip-coord triple
*x_out = tmp[0]; *y_out = tmp[1]; *z_out = tmp[2]
return tmp[2] // (redundant) scalar return = z
function TpuCoreLocation_HostCoordinates(loc, x_out, y_out, z_out): // 0xEABC300
*x_out = *(uint32*)(loc + 8) // direct field reads — no member call
*y_out = *(uint32*)(loc + 12)
*z_out = *(uint32*)(loc + 16)
return *(uint32*)(loc + 8)
function TpuCoreLocation_Index(loc): // 0xEABC320
return *(uint32*)(loc + 52)
QUIRK —
ChipCoordinatesroutes through thechip_coordinates()member (a computation, not a stored triple) whileHostCoordinatesis three flat field reads at+8/+12/+16. The asymmetry tells a reimplementer the storage decision: host coordinates are stored verbatim on the core-location record, but chip coordinates are derived (likely from the flattened id plus the chip bounds) and must be recomputed on each call. Both expose the same(int32*, int32*, int32*)out-param shape and a redundant scalar return, so the host signature is uniform even though the implementations differ.
NOTE —
TpuCoreLocation_Idthunks toLogicalDeviceId, the logical-device id, not a raw physical core id — consistent withTpuTopology_CoreForIdthunking toLogicalDeviceForId. The "id" on this C surface is the flattened logical-device index that XLA's device-assignment uses, the same id space the PJRT device descriptions andLogiDeviceIdFromChipCoordAndIdxextension method operate in. A reimplementer must keep "id" meaning logical-device id throughout this roster.
4. Validity Gating
Unlike the PJRT extension (which negotiates args-struct sizes and returns PJRT_Error), the SE C-ABI accessors trust their inputs and gate only on a few hard invariants:
| Accessor(s) | Guard | Failure behaviour |
|---|---|---|
LogicalDevicesPerHost / PerChip | enum remap if t != 1: t = 2·(t==2) | clamps unknown core types to 0 |
Version | ordinal < 4 | returns 0 for unrecognised codename |
AvailableCoreCount | core_type < 3 | BUG() (abort) past the bound |
AvailableCoresPerChip | topology non-null; type < 3 | returns 4 if no topology; BUG() past bound |
MaybeAvailableSparseCoresPerLogicalDevice | topology non-null; type == 2 | absl::Status error: "TPU system is not available" / "Invalid core type queried" |
pure field readers (HostCount, ChipBounds_*, …) | none | unconditional field read |
GOTCHA — the pure field readers perform no null check on the topology pointer. They assume the host passed a live, fully-constructed
tpu::TpuTopology(built by the TpuPlatform/TpuNodeContext bring-up). The only accessors that defend against a missing topology are the three "available" functions, which resolve it lazily from the mesh/ops-util singleton and so must cope with "no TPU attached." A reimplementer must construct the topology before any pure-reader call; there is no graceful path fortopo == NULLonChipBounds_Xand friends — it dereferencesNULL + 88.
Related Components
| Name | Relationship |
|---|---|
tensorflow::tpu::TpuTopologyExternal::* | host-side C++ wrappers that index ExecutorApiFn() slots into these C-ABI impls |
tpu::TpuTopology / tpu::TpuCoreLocation | the C++ objects whose fields the roster reads |
tensorflow::TpuMeshCommonState::tpu_topology / tpu_ops_util::GetTpuTopology | the singletons the "available" accessors resolve the topology from |
xla::jellyfish::NumEmbeddingDevices | backs MaybeAvailableSparseCoresPerLogicalDevice |
| PJRT type-16 topology extension | the modern parallel surface exposing overlapping geometry via a different C ABI |
Cross-References
- The TfTpu C-API Shim — the
*ApiFn()accessor pattern, opaque-handle convention, and the roster map this page is one entry of - TpuPlatform & TpuNodeContext — the
TpuPlatform_*/TpuNodeContext_*lifecycle that constructs thetpu::TpuTopologythese accessors read - TpuExecutor Roster — the sibling
TpuExecutor_*per-device runtime C-ABI cluster reached through the sameExecutorApiFn()table - TopologyDescription Extension (type 16) — the modern PJRT topology object; contrast: this page is the C-accessor layer beneath it, no args-size negotiation, no
PJRT_Error - ICI Topology Discovery — the on-device runtime discovery of the torus geometry these accessors statically expose
- SparseCore Architecture — Part-IV silicon geometry: the full
tpu::TpuTopologystruct and the per-core-type record at+124/+128(Target+0x3b8) thatAvailableCoreCount/AvailableCoresPerChipindex