Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

On-Pod Collectives — Section Map

Binary: extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so (build-id 89edbbe81c5b328a958fe628a9f2207d, build libtpu_lts_20260413_b_RC00; .text VMA == file offset 0xe63c000). Status: Reimplementation-grade map · Evidence grade: Confirmed (byte-anchored), substrate split / end-to-end flow / op-family dispatch all cross-checked against the IDA decompile · Part XIII — On-Pod Collectives & Barriers / Collective algorithms · back to index

Abstract

This page is the map of the TPU collective stack as reconstructed from the (unstripped, full-symbol) libtpu.so. A collective in an XLA/HLO module — all-reduce, all-gather, reduce-scatter, all-to-all, collective-permute and their async/ragged variants — is lowered to ICI (inter-chip-interconnect) ring traffic over the physical torus. The compiler does this on two distinct execution substrates: the dense TensorCore path that drives ICI DMA directly, and the SparseCore-offload path that hands embedding-class collectives to the SparseCore as a separate operating point. This page documents (1) the substrate split and what gates SC-offload, (2) the end-to-end flow from HLO collective op to emitted ICI ring schedule, and (3) the collective op family with its per-kind cost/emitter dispatch. Each algorithm, the routing/twist/barrier subsystems, and the SC-offload config builder are sibling pages — this page links them; it does not duplicate their byte-level derivations.

Contract of the collective stack as observed in the binary:

  • Every collective is reduced over the physical torus (3 dimensions X/Y/Z), not over the logical replica list directly: the partitioner and the cost model both query topology (ReplicaGroupsOnNDPlane, EstimatePhysicalLinksUsed) to derive how many torus dimensions the collective's replica-groups span.
  • Strategy selection is flag-and-shape driven, not cost-compared. BaseStrategyND::SelectNDStrategy @0x137c78e0 picks the emitter family (sub-plane / ND-ring / N-Way / twisted-torus / strided) from TpuCompEnv flags + torus extents; the cost model (CostModel::GetCollectiveCycles @0x130abfc0) only produces the scheduler's per-resource cycle deposit. The one true cost-vs-cost comparison is the SPMD partitioner's GetCommunicationTimeInMilliSec, used to choose sharding, not the emitter algorithm.
  • ICI bandwidth is modeled per-direction (bidirectional ring): the shared term is eff_Bps = IciGigabytesPerSecond() · 0.5 · 1e9; there is no additive latency term in any collective branch — bundle collective cost is pure bandwidth, in TensorCore cycles.
  • SC-offload is gated by a Target capability bit + a platform-type bool, with a per-generation hardware basis (TpuVersion == 5); when the gate holds, the embedding collective is emitted as a CollectiveIciStrategyConfig proto of per-color UNIDIR rings rather than as HLO ReplicaGroup device lists.

At a glance

AspectTensorCore (dense) substrateSparseCore-offload substrate
Emitter selectorBaseStrategyND::SelectNDStrategy @0x137c78e0ConstructConfigForCollectiveUniDirNDGroups<*> @0x133c82c0 / 0x133c2dc0 / 0x133cd800
Output of selectionheap StrategyND* family → HLO ReplicaGroup device listsCollectiveIciStrategyConfig proto (per-color UNIDIR rings)
Cost estimatorCostModel::GetCollectiveCycles @0x130abfc0 (TC cycles)SC ring cost via GetCollectiveOffloadConfig @0x133e1740 probe
Async tracker / scheduler resourcesjellyfish TpuAsyncTracker {13..46}plain SparseCoreAsyncTracker (base {0..12}) → resource-aware {13..17}
Gatealways (dense XLA path)Megachip ∧ CoresPerChip(SC)>0 ∧ (Target[+0x628]&4 ∨ Target[+0x540]) ∧ ModuleContainsLEM… ∧ FLAGS_xla_sc_enable_latency_hiding_scheduler
Op-type keyHLO opcode + SparseCoreConfig.offload enumSparseCoreConfig.offload (field 2, xla::jellyfish::Offload)

1. The two execution substrates

The binary realizes on-pod collectives through two parallel lowerings that share the same physical torus but differ in who drives the ICI DMA and how the schedule is built.

1.1 TensorCore-driven ICI collectives (dense)

The dense path is the default. The XLA collective op survives into the jellyfish backend, where BaseStrategyND::SelectNDStrategy selects an emitter strategy and the cost model charges ICI cycles into a ResourceVector. The strategy object (a StrategyND subclass) produces the per-color ring decomposition — sequences of HLO ReplicaGroup device lists — that the collective emitters turn into ICI DMA descriptors. The TensorCore issues the ring transfers; the cost the scheduler sees is the per-torus-dimension ICI ring cost.

Confirmed in the decompile: SelectNDStrategy constructs StrategySubgroupND, StrategyND (the umbrella 1D/ND-ring class, also used for the N-Way and strided variants), and TwistedTorusND, gated by IsGroupNDPlane, UseSpecialStrategyNDNWay, UseStridedStrategyND, and a single-ND-plane test via ReplicaGroupsOnNDPlane(…, plane=2, …). The terminal classes are detailed in SelectNDStrategy.

1.2 SparseCore-offloaded collectives

Embedding-class collectives (the gradient all-reduce / all-gather / reduce-scatter that arise from sparse embedding lookups) can be offloaded to the SparseCore. Instead of emitting HLO ReplicaGroup lists, the SC path builds a CollectiveIciStrategyConfig proto — a per-color set of UNIDIR rings (ICI_RING_TYPE_UNIDIR_CW / _CCW) over the same X/Y/Z torus extents — embedded inside an AllGatherOffloadConfig / AllReduceOffloadConfig / ReduceScatterOffloadConfig backend-config message (sizeof 0x48, byte-identical layout). The cost model, when it sees an offloaded collective, probes GetCollectiveOffloadConfig @0x133e1740 and charges the SC ring operating point rather than the dense TC one.

This substrate runs its own latency-hiding scheduler with two trackers in sequence: the plain SparseCoreAsyncTracker (vtable @0x2190da10) — which has no target-defined resources and only throttles the base XLA collective resources {0..12}, classifying by opcode + the custom-call target name "AllToAllDynamic" — followed by the resource-aware SparseCoreResourceAwareAsyncTracker (vtable @0x2190e1b0) carrying the {13..17} = {SCS, SCT, ICI, LocalReduction, 2DAllToAll} resource caps {1, 20, 5, 1, 1}.

The SC config builder is the SparseCore analog of the TC StrategyND::BuildStrategy; it is fully documented in SC-Offload Config Builder, its phase-split flag in HierarchicalKind, and its core selection in SC Core-Selection (Offload).

1.3 Shared substrate: the physical-torus mesh decomposition

Both substrates reduce over the same physical torus and share the topology-derivation machinery — this is the glue that keeps the dense and offload cost/dimension models consistent. A collective's replica-groups are mapped onto the torus and reduced to a per-dimension mesh descriptor:

  • ReplicaGroupsOnNDPlane @0x1c890960 (memoized on an NDPlaneCacheKey → vector<MeshNDInfo>) decomposes the replica-groups onto the torus via TensorCoreLocationForLogicalDeviceIdTpuCoreLocation::Chip() (physical chip coordinates) and reports how many torus mesh dimensions the groups span. Both the SPMD partitioner (the link-count divisor, §3) and the dense picker (the single-ND-plane test, §4) call it with plane = 2.
  • EstimatePhysicalLinksUsed @0x1c8939c0 walks the same chip coordinates to count the physical ICI links a collective uses — the divisor for the all-to-all / ragged / cross-module-all-reduce cost branches.
  • The torus extents X/Y/Z are read at the same chip-config offsets ([chip_cfg+0x58] / +0x5c / +0x60) by the dense picker, the cost model, and the SC GetDimensionRings @0x133df520 — so the dense StrategyND ring dims and the SC IciStrategyRingDim ring dims index the identical hardware geometry.

The per-dimension ICI resource map is shared too: GetResourceFromIciResource @0x1c894c00 maps IciResource ∈ [1..6] to ResourceVector slots {0xd,0xe | 0xf,0x10 | 0x11,0x12} = 3 torus dimensions (Y, X, Z) × 2 ring directions (±). The degraded-axis remap demotes a failed axis's two slots out of the primary ring (see Degraded-Axis Ingest).


2. End-to-end flow

The lowering of one collective op, from HLO to emitted ICI traffic, proceeds through the stages below. Stages 2–4 are the dense TC path; the SC-offload path forks at stage 2 (gate) and replaces stages 3–5 with the config builder of §1.2.

HLO collective op  (all-reduce / all-gather / reduce-scatter / all-to-all / collective-permute)
        │
   [1]  classify opcode → IsNonFusionCollective; read SparseCoreConfig.offload (field 2)
        │
   [2]  SUBSTRATE GATE
        │   SC-offload? (Megachip ∧ CoresPerChip(SC)>0 ∧ (Target[+0x628]&4 ∨ Target[+0x540])
        │                ∧ ModuleContainsLEMSparseCoreInstruction ∧ FLAGS_xla_sc_enable_lhs)
        ├── yes ─────────────► SC-OFFLOAD: ConstructConfigForCollectiveUniDirNDGroups<*>
        │                       → CollectiveIciStrategyConfig (per-color UNIDIR rings)
        │                       → SparseCore latency-hiding schedule
        │
        └── no  (dense TensorCore path)
                 │
            [3]  STRATEGY SELECTION   BaseStrategyND::SelectNDStrategy @0x137c78e0
                 │   entry fold: is_cross_module &= hlo->IsCrossReplicaAllReduce()
                 │   sub-plane? ND-plane? N-Way? twisted-torus? strided? → subgroup default
                 │   (degraded-axis remap folds in here via ComputeColorDimensions)
                 │
            [4]  ND-STRATEGY / TOPOLOGY
                 │   ReplicaGroupsOnNDPlane(plane=2) → mesh-dim count over physical torus
                 │   ComputeColorDimensions → [6][3] per-color ring-dimension table
                 │
            [5]  ROUTE-TABLE GENERATION  (per-color RingLocation neighbor schedule)
                 │
            [6]  BARRIER / SYNC          (replica/TensorCore barrier, SFLAG binding)
                 │
            [7]  ICI DMA EMISSION        (per-torus-dimension ring DMA descriptors)

The cost model (CostModel::GetCollectiveCycles @0x130abfc0) runs orthogonally to stages 3–7: it consumes the same topology dimension count that stage 4 derives, and deposits per-torus-dimension cycle estimates into the scheduler's ResourceVector. It does not select the emitter algorithm. Routing, twist geometry, and barriers are sibling sections — see Routing, Twisted Torus, Barriers, and the ICI fabric for the DMA layer.

2.1 Scheduler resource spaces

Stages 6–7 are governed by a latency-hiding scheduler whose async tracker classifies each collective into resource slots and enforces concurrency caps. Three trackers exist, installed on different substrates within the same compile:

Tracker (vtable)SubstrateGetResourcesFromInstructionImplResource spaceClassifier key
jellyfish TpuAsyncTrackerTensorCore LHS@0x11001040 (own + 6 MayAdd* helpers){13..46}opcode + SparseCoreConfig.offload + collective_id
SparseCoreAsyncTracker @0x2190da10plain SC LHS@0x136122a0 (base opcode→rt only)base {0..12} (no target resources)opcode {0xc, 0x10/0x11, 0x31} + custom-call target "AllToAllDynamic"
SparseCoreResourceAwareAsyncTracker @0x2190e1b0cost-model SC LHS@0x134a7580 (own jump table){13..17} = {1, 20, 5, 1, 1}opcode−3 jump → SCS / SCT / ICI / LocalReduction / 2DAllToAll

When the SC-offload gate (§5) holds, the plain SparseCoreAsyncTracker runs first (base-resource throttle + the FindNearestAllToAlls all-to-all post-process), then the resource-aware tracker refines with the {13..17} caps. The plain tracker is the surprising one: it defines no target resources and overrides only IsSupportedAsyncStart/Done (@0x134964c0 / 0x13496520) and PostProcessScheduleGraph — its async-schedulable set is opcode all-to-all (0xc), async start/done (0x11/0x10), and custom-call (0x31) iff the target name maps to SparseCoreOperationType == 8 ("AllToAllDynamic").


3. The collective op family

The HLO opcode integers below were length-verified via the HloOpcodeString table and confirmed in the GetCollectiveCycles jump table (decompile cases 6/8, 9/11, 12, 34/36, 86, 93). Each data-carrying opcode routes to a per-kind cost branch; the async shells (-start/-done) and collective-broadcast contribute zero ICI bundle cost — the cost is charged on the data-carrying opcode.

Collective (opcode)Cost branch (GetCollectiveCycles)Dense emitter / strategyPer-page
all-gather (6), all-gather-start (8)AllGather branch @0x130ac06c (1D ÷2, 2D ÷4)StrategyND ND-ring (UseAllGather2D)AllGather ND-Ring
all-reduce (9), all-reduce-start (11)AllReduce ND-plane branch @0x130ac14c2·num_dims)sub-plane / hierarchical / pincer / binomialHierarchical / Pincer, Binomial / Recursive-Doubling
reduce-scatter (93)AllReduce-family path (RS phase, ÷2·num_dims)RS phase of the AR decompositionReduceScatter
all-to-all (12)ComputeAllToAllCycles @0x130ae8e0EstimatePhysicalLinksUsed)all-link saturatingAllToAll Tables
ragged-all-to-all (86)ComputeRaggedAllToAllCycles @0x130aea80 (shares A2A helper)ragged A2AAllToAll Tables
collective-permute (34), -start (36)CollectivePermute branch @0x130ac40f (÷1, single-link)point-to-point(cost: SPMD Link-Count Cost)
*-done (7/10/35), collective-broadcast (33), cp-donedefault @0x130ae5460 cycles

Notes on the cost shape (full per-kind formulas live in SPMD Link-Count Cost):

  • The all-reduce ND-plane branch charges B = 2 · operand_size (reduce-scatter + all-gather phases) over num_dims = popcnt(active torus axes), depositing into the two ICI slots of each active dimension.
  • All-to-all / ragged-all-to-all divide by EstimatePhysicalLinksUsed and a {1D→2.0, 2D→4.0} per-link table, saturating all six ICI slots 0xd..0x12.
  • Collective-permute is the single point-to-point case (÷1, no ×2 bidirectional factor); AllPairsUseSameIciLink narrows the deposit to one ICI resource when every (src,dst) pair rides the same link.

The cost-vs-cost decision that does happen is in the SPMD partitioner: GetCommunicationMultiplier @0x127a16c0 returns ReplicaGroupsOnNDPlane(plane=2).num_mesh_dims + 1 as the link-count divisor (confirmed in the decompile: return (unsigned int)v7 + 1 after the plane=2 query, with the GetMultiSliceTopology fork to the inter-slice rate). See SPMD Link-Count Cost.


4. Strategy selection (dense)

BaseStrategyND::SelectNDStrategy is the dense-substrate picker. It splits on the enable_sub_plane argument, then on topology and TpuCompEnv flags, producing one of five terminal strategy classes. The decision is predicate-and-flag driven; the table below summarizes the branch order (full derivation, guard predicates, VLOG names, and object sizes in SelectNDStrategy).

OrderGuard (summary)Strategy builtVLOG name
Aenable_sub_plane ∧ all-reduce ∧ !cross_module ∧ env[0xe1f] ∧ single ND-planeStrategySubgroupND (0x638)"Enabling ND sub-plane allreduce"
B!enable_sub_plane ∧ IsGroupNDPlane ∧ env[0x1015]StrategyND ND-ring (0x5f0)"Enabling 2-D algorithm …"
C-icross_module ∧ UseSpecialStrategyNDNWay (single-slice, 2-/4-way)StrategyND N-Way"Enable Strategy NDNway"
C-iisingle-module ∧ twisted-torus shape (2·a == dim)TwistedTorusND (0x610)"AllReduceEmitter: Choosing twisted topology"
C-iiiUseStridedStrategyND (single-slice, NumNetDims==3, LDPC==1)StrategyND strided"Enable StridedStrategyND"
DelseStrategyND default ND-ring"Enable StrategySubgroupND."

Each StrategyND then resolves UniDirection1DRingStrategy vs UniDirectionNDRingStrategy in BuildStrategy @0x137c4660 via the [obj+0xa8] gate. The per-color ring-dimension table comes from ComputeColorDimensions @0x137c3ba0, which is where the degraded-axis fault-tolerant remap folds in (a failed torus axis is demoted to the inner ring dimension; the effective dimension count drops 3→2) — see Degraded-Axis Ingest. The twisted-torus geometry is its own section: Twisted Torus.


5. The SC-offload gate

SparseCore offload is enabled only when all of the following hold (confirmed byte-exact in SparseCoreCompiler::RunHloScheduler @0x1306f820; offsets 1576 = 0x628, 1344 = 0x540 match the decompiled (*((_BYTE *)v6 + 1576) & 4) != 0 || *((_BYTE *)v6 + 1344)):

runSC =  TpuChipConfig::Megachip(Target chip-config)                    @0x1306f84c
      ∧  CoresPerChip(kSparseCore) > 0                                  @0x1306f863
      ∧  ( Target[+0x628] & 4   ∨   Target[+0x540] ≠ 0 )                @0x1306f86c / @0x1306f87a
      ∧  ModuleContainsLEMSparseCoreInstruction(module)                @0x1306fbc8
      ∧  FLAGS_xla_sc_enable_latency_hiding_scheduler                  @0x1306fc04

The two Target fields are written in jellyfish::Target::Init @0x1d60fc20:

  • Target[+0x628] & 4 — the SC-offload-capability has-bit (|= 0x4 @0x1d612121), OR'd in inside a config-append loop gated by the same SC-offload feature-detect predicate. This is the real-hardware path.
  • Target[+0x540] — a platform-type bool, set (TpuTopology[+0] == 2) @0x1d610b1b (the iss/simulator platform), which force-takes the SC path for the simulator.

The per-generation hardware basis is TpuVersion == 5 (the newest generation, codename obfuscated as "6acc60406" in this build): ShouldEnableConcurrentSparseCoreOffloading @0x1d6b6f80 and EnableSparseCoreOffloadQueuingInLhs @0x1d6b81e0 both default (TpuChipParts[+0] == 5), overridable by an AutoOr<bool> proto flag. The internal TpuVersion enum is 0 jellyfish, 1 dragonfish, 2 pufferfish, 3 viperfish, 4 ghostlite, 5 "6acc60406" (proto value = internal + 1).

5.1 Op-type classification: SparseCoreConfig.offload

Once gated in, the SparseCore op type is read from SparseCoreConfig field 2 offload, a TYPE_ENUM of type xla::jellyfish::Offload (struct offset +0x24, has-bit +0x10 mask 0x4). It is a backend-config enum — not a custom-call target name and not an MLIR op kind. This enum routes the op into the scheduler's kSparseCore* resource arms:

Offload valueEnumeratorResource arm (scheduler, idx = enum − 2)
0OFFLOAD_UNSPECIFIED(none; rt22 ×N-cores path)
1OFFLOAD_EMBEDDING(none; rt22 ×N-cores path) — reservation map only
2OFFLOAD_GATHERrt23 kSparseCoreGather
3OFFLOAD_SCATTERrt24 kSparseCoreScatter
4OFFLOAD_COLLECTIVEasync-body recurse
5OFFLOAD_DATA_FORMATTINGrt25 kSparseCoreDataFormatting
6OFFLOAD_KERNELrt26 kSparseCoreKernel
7OFFLOAD_SORTrt27 kSparseCoreSort
8OFFLOAD_COMPUTE(none; rt22 ×N-cores path)

The OFFLOAD_COLLECTIVE case (4) is the one that reaches the offload collective config builder of §1.2. Full enum derivation and the reservation-map twin (GetSparseCoreResources, idx = enum − 1) are in SC Core-Selection (Offload) and SC-Offload Config Builder.

5.2 What the SC substrate emits

The offload config builder (ConstructConfigForCollectiveUniDirNDGroups<*>) produces a CollectiveIciStrategyConfig proto nest rather than HLO ReplicaGroup lists. The shape of that nest is the SC substrate's counterpart to the dense StrategyND per-color ring schedule:

{AllGather|AllReduce|ReduceScatter}OffloadConfig   (sizeof 0x48, byte-identical layout)
  └─ ici_strategy_config : CollectiveIciStrategyConfig   (field 2)
       └─ color_strategies[] : PerColorIciStrategyConfig
            └─ phase_rings[]  : IciStrategyRingConfig    (ring_type, ring_dim, core_count, …)

The ring dimensions are drawn from the same X/Y/Z torus extents as the dense path, via the IciStrategyRingDim enum (8 values): ICI_RING_DIM_{X,Y,Z}_{TORUS,MESH} (1/2, 3/4, 5/6) and ICI_RING_DIM_D2D (7), with 0 invalid. UNIDIR rings emit ICI_RING_TYPE_UNIDIR_CW / _CCW. Whether the builder emits a single flat ring per axis or a multi-phase hierarchical decomposition is the HierarchicalKind decision (an AutoOr<bool> packing of xla_tpu_enable_sparse_core_hierarchical_all_reduce; AllGather/ReduceScatter are pinned flat, only AllReduce can be hierarchical in this build) — see HierarchicalKind. The SC twisted-torus path branches off the same K/2K mesh-dimension count gate the dense path uses; see Twisted Torus.


6. Verification notes

Substrate split, end-to-end flow, op-family dispatch, and the SC-offload gate were all cross-checked against the IDA decompile of libtpu.so v0.0.40:

  • CostModel::GetCollectiveCycles @0x130abfc0: opcode jump table cases 6/8 (AllGather → UseAllGather2D), 9/11/93 (AllReduce-family → ComputeAllReduceCycles), 12 (ComputeAllToAllCycles), 86 (ComputeRaggedAllToAllCycles), 34/36 (CollectivePermute); TensorCoreFrequencyInMegaHertz; GetCollectiveOffloadConfig SC-offload probe — all present.
  • GetCommunicationMultiplier @0x127a16c0: ReplicaGroupsOnNDPlane(…, 2, 0) then return (unsigned int)v7 + 1; GetMultiSliceTopology fork; ConstructSliceTransferGroup(mode=3) — exact.
  • BaseStrategyND::SelectNDStrategy @0x137c78e0: StrategySubgroupND, StrategyND (NDNway/strided), TwistedTorusND constructed; IsGroupNDPlane, UseSpecialStrategyNDNWay, UseStridedStrategyND guards; ReplicaGroupsOnNDPlane(plane=2); VLOG strings — exact.
  • SparseCoreCompiler::RunHloScheduler @0x1306f820: Megachip ∧ CoresPerChip(SC)>0 ∧ ((Target[+0x628]&4) ∨ Target[+0x540]) ∧ ModuleContainsLEMSparseCoreInstruction ∧ FLAGS_xla_sc_enable_latency_hiding_scheduler — exact (offsets 0x628/0x540 confirmed).

[LOW] Asynchronous-shell zero-cost set: opcodes 7/10/35 and collective-broadcast (33) are assigned to the default (0-cycle) branch per the cost-table derivation, but only the data-carrying opcodes were individually re-confirmed in the live jump-table walk here. The behavior (cost charged on the data-carrying opcode) is consistent across the AllReduce/AllGather/AllToAll branches.


Cross-References

Dense TensorCore collectives

Cost model

  • SPMD Link-Count CostGetCommunicationMultiplier, per-kind GetCollectiveCycles formulas, ICI resource slots

SparseCore-offload substrate

Sibling subsystems

  • Routing — route-table generation, toroidal route cache, unicast emission
  • Twisted Torus — twisted-torus geometry, 2-phase replica-group construction
  • Barriers — replica / TensorCore barriers, SFLAG binding, tree-barrier vsync
  • ICI fabric — the inter-chip interconnect DMA layer
  • back to index