Physical-Core Placement
Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d, buildlibtpu_lts_20260413_b_RC00;.textVMA == file offset0xe63c000,.rodataVMA == file offset0x84a0000). Status: Reimplementation-grade · Evidence grade: Confirmed (byte-anchored) — thephysical_core_indiceswrite path (AddCollectivePhysicalCoreIndicesHelper), the read-back (GetPhysicalCoreIndices), and thetensor_split_mode==2classifier (TensorSplitPerCoreClassifier) were each cross-checked against the IDA decompile of all three functions; two residual sub-layouts marked [LOW] below · Part XIII — On-Pod Collectives & Barriers / SparseCore-offload collectives · back to index
Abstract
This page documents the physical-core placement datapath of a SparseCore-offloaded embedding collective: how a collective's logical colors land on a concrete set of physical SparseCore (SC) cores, and how the per-core tensor-split (tensor_split_mode == 2) emission is keyed. It owns three byte-exact mechanisms:
- the
physical_core_indicesfill —AddCollectivePhysicalCoreIndices(@0x1c868500), a recursive async/fusion walker, and its writerAddCollectivePhysicalCoreIndicesHelper(@0x1c868920), which copies the chosen-coreabsl::Span<long const>verbatim (truncatedlong→int32) into the per-collective*OffloadConfigvariant'srepeated int32 physical_core_indices(proto field 4); - the
physical_core_indicesindex-array layout + read-back — the proto2RepeatedField<int32>at[variant+0x18..+0x20], read back (widened tolong) byGetPhysicalCoreIndices(@0x1c8692e0) into aStatusOr<absl::InlinedVector<long,4>>; - the
tensor_split_mode==2per-core emission key —TensorSplitPerCoreClassifier(@0x13379de0), which maps a per-colorUniDirRingStrategyto a classification key —(axis classcode 0/2/4) | (tensor-split bool)for the torus axes, or the rawD2DUniDirRingStrategy::core_classifier_count (nosplit_bitOR) forkCoresOnChip— and theCanBeCombined(@0x13379d60) predicate that merges per-color strategies into one per-SC-core partial-tensor op.
The core-selection policy — which cores are eligible and the order in which SelectCores builds the chosen list — is not owned here; it lives on SC Core-Selection (Offload) and SC Core Selection. The ND-plane / tensor-split-factor derivation lives on Tensor-Split / ND-Plane. This page picks up at the moment the chosen-core Span exists and follows it into the proto, and picks up the split-mode-2 emission at the classifier that keys it.
Contract of the placement datapath as observed in the binary:
physical_core_indicesis the sorted set of chosen physical SC core IDs produced bySparseCoreQueueAssignment::SelectCores, numerically__sorted by the callerAssignQueueIDsToAsyncStart(@0x10fdf480), then handed toAddCollectivePhysicalCoreIndicesas anabsl::Span<long const>.- The fill is purely a copy:
AddCollectivePhysicalCoreIndicesHelpertruncates eachlongtoint32and appends it to the variant'sRepeatedField<int32>— there is no per-element transformation, scaling, or re-indexing. - The same
Spandrives two consumers: it is copied intophysical_core_indicesand appended to theMegaChipParallelismConfigrepeated-long field — the placement and the mega-chip parallelism share one selected-core list. - The walker is recursive over fusion bodies: an async-wrapped fusion's
called_computationsare walked, and every sub-collective opcode in{6, 9, 12, 86, 93}receives the sameSpan. TensorSplitPerCoreClassifierreturns a per-SC-core grouping/sort discriminant, not an instruction; forIciDim::kCoresOnChipthe discriminant is theD2DUniDirRingStrategy::core_classifier_count — the megacore cross-core split (LogicalDevicesPerChip(SparseCore) == 2), the SC analog of the dense TensorCore megacore data-split.
At a glance
| Aspect | Value (byte-anchored) |
|---|---|
| Walker (recursive) | AddCollectivePhysicalCoreIndices @0x1c868500 → bool |
| Writer | …Helper @0x1c868920 (anon-ns) → bool |
| Reader | GetPhysicalCoreIndices @0x1c8692e0 → StatusOr<InlinedVector<long,4>> |
| Producer (Span) | AssignQueueIDsToAsyncStart @0x10fdf480 (call @0x10fdff30) |
| Sibling producer | OffloadCollective @0x10fc75e0 (SetMegaChipParallelism + AddCollective…) |
| Index array | proto2 RepeatedField<int32> at [variant+0x18] flags / [+0x1c] size / [+0x20] Rep* (data at Rep+0x8) |
| Oneof variant ptrs | AR [CollectiveOffloadConfig+0x18] · AG [+0x20] · RS [+0x28] (discriminant dword [+0x10]) |
| Walker opcode set | {6, 9, 12, 86, 93} (jump table @0xb438098, mask 0x1240 + explicit cmp 0x56/0x5d) |
| Read-back error lines | 523 / 527 / 565 (backend_config_util.cc), all MakeErrorImpl<13> (INTERNAL) |
| Split-2 classifier | TensorSplitPerCoreClassifier @0x13379de0 → axis 0/2/4 | byte[s+0x43]; kCoresOnChip → raw D2D core_classifier_ (no OR) |
ici_dim() | vtable +0x68 = byte[s+0x80] | 0x100 (AutoOr-engaged IciDim) |
| Combine key | core_count() (vtable +0x30 = qword[s+0x88]) via CanBeCombined @0x13379d60 |
| Reader consumer | CheckCoreAssignmentConsistency @0x1c869cc0 |
1. The physical_core_indices field — array layout
physical_core_indices is proto field 4, a repeated int32, carried inside each per-collective *OffloadConfig variant (AllGatherOffloadConfig / AllReduceOffloadConfig / ReduceScatterOffloadConfig), which are byte-identical (see SC-Offload Config Builder §4). The field is a standard proto2 RepeatedField<int32> and occupies four offsets inside the variant:
| Offset (rel. variant) | Field | Meaning |
|---|---|---|
[variant+0x10] | int32 has-bits | bit0 = physical_core_indices present |
[variant+0x18] | RepeatedField flags | arena tag | low bit = "has heap Rep" |
[variant+0x1c] | int32 current_size | element count |
[variant+0x20] | Rep* | element data starts at Rep+0x8 (int32[]) |
Both the writer and the reader agree on this layout. In
…Helper @0x1c868920the AR fill clears*(_DWORD*)(v9+28)(=[+0x1c]size) and*(_BYTE*)(v9+16) &= ~1u(=[+0x10]has-bit), then per element storesv16[(int)v14 + 2] = v17(=Rep+0x8 + idx*4), sets*(_DWORD*)(v9+28) = size+1, and*(_BYTE*)(v9+16) |= 1u. InGetPhysicalCoreIndices @0x1c8692e0the read picksRep*from*(_QWORD*)(v9+32)([+0x20]) when the heap-bit*(_BYTE*)(v9+24) & 1([+0x18]) is set, count from*(int*)(v9+28)([+0x1c]), and reads element data fromRep+0x8.
NOTE —
[variant+0x28]is a separate scalar that the ctor zeroes (*(_DWORD*)(v_variant+0x28)) and the fill never touches; it is not part of thephysical_core_indicesrepeated field, which occupies[+0x18..+0x20]as tabulated above.
1.1 The oneof variant select
physical_core_indices is not a top-level *OffloadConfig field reached directly — it is reached through the CollectiveOffloadConfig oneof wrapper. Both the writer and the reader select the variant by the discriminant dword at [CollectiveOffloadConfig+0x10]:
| Discriminant bit | Variant | Variant ptr (rel. CollectiveOffloadConfig) |
|---|---|---|
& 0x1 | AllReduceOffloadConfig | [+0x18] (*((_QWORD*)cfg+3)) |
& 0x2 | AllGatherOffloadConfig | [+0x20] (*((_QWORD*)cfg+4)) |
& 0x4 | ReduceScatterOffloadConfig | [+0x28] (*((_QWORD*)cfg+5)) |
& 0x8 | RaggedAllToAllOffloadConfig | mutable_ragged_all_to_all_offload_config |
& 0x10 | AllToAllOffloadConfig | mutable_all_to_all_offload_config |
The two *_to_all variants are handled by a templated lambda $_0<RaggedAllToAllOffloadConfig> / $_0<AllToAllOffloadConfig> rather than the inline AR/AG/RS loop, but the field layout (proto2 RepeatedField<int32> at the same relative offsets) is identical.
In
…Helper, the discriminant isv8 = *((_DWORD*)v6 + 4)(=[cfg+0x10]), tested& 2(AG, ptr*((_QWORD*)v6+4)),& 1(AR,*((_QWORD*)v6+3)),& 4(RS,*((_QWORD*)v6+5)),& 8(ragged →mutable_ragged_all_to_all_offload_config(v6)+$_0),& 0x10(a2a →mutable_all_to_all_offload_config(v6)+$_0).GetPhysicalCoreIndicesselects the identical bits (v60 & 2gates the "no collective offload config" error; then the same& 2 / & 1 / & 4 / & 8 / & 0x10).
2. The fill path — AddCollectivePhysicalCoreIndices + …Helper
2.1 The recursive walker (@0x1c868500)
AddCollectivePhysicalCoreIndices(HloInstruction*, absl::Span<long const>) → bool is not the writer; it is a recursive collective-walker that forwards the same Span to every sub-collective:
AddCollectivePhysicalCoreIndices(hlo, Span indices):
CHECK(hlo->IsAsynchronous()) (@0x1e592520) // FATAL "inst->IsAsynchronous()" (.cc:433)
wrapped = hlo->async_wrapped_instruction() (@0x1e5aa300)
if opcode_byte[wrapped+0xc] != 0x1b (fusion) → return 1
for each instr in wrapped->called_computations() (@0x1e5885a0):
switch opcode_byte[instr+0xc] (jump table @0xb438098, opcode-6 index):
6 / 9 / 12 → AddCollectivePhysicalCoreIndicesHelper(instr, indices) // mask 0x1240
0x56 (86) → AddCollectivePhysicalCoreIndicesHelper(instr, indices) // explicit cmp
0x5d (93) → AddCollectivePhysicalCoreIndicesHelper(instr, indices) // explicit cmp
0x28 (40) → recurse into instr->called_computations()
other → skip
return 1
The opcode set {6, 9, 12, 86, 93} is the SparseCore async-collective plus all-reduce / all-gather / reduce-scatter / all-to-all family. Opcode 0x28 (fusion) triggers a recursive descent, so a nested fusion body is fully walked. The walk is depth-first; the same Span reaches every matched leaf.
[CONFIRMED — symbol + structure] The walker symbol
…AddCollectivePhysicalCoreIndicesEPNS_14HloInstructionEN4absl4SpanIKlEE @0x1c868500is present in*_functions.json, with its two source call sites inAssignQueueIDsToAsyncStart(@0x10fdff30) andOffloadCollective(@0x10fc871c). The entry is guarded by a FATALCHECK(inst->IsAsynchronous())(backend_config_util.cc:433) — a non-async input aborts, it is not a silent no-op. The jump-table form (@0xb438098,0x23dwords, opcode−6 index; mask0x1240= bits 6/9/12; explicitcmp eax, 0x56/0x5d; recurse target for0x28) is byte-traced from the disassembly. The Helper call forwards the same(rsi=data, rdx=count)pair to each matched instruction.
2.2 The writer (…Helper @0x1c868920)
AddCollectivePhysicalCoreIndicesHelper(hlo, Span data, count) → bool:
bc = GetBackendConfig(hlo) (@0x1c8664a0) // absent → return 1
cfg = bc.CollectiveOffloadConfig ([BackendConfig+0x178], default-constructed if missing)
select variant by discriminant [cfg+0x10] (AR/AG/RS/ragged/a2a, §1.1)
// FILL (byte-identical for AR/AG/RS):
variant.physical_core_indices.current_size = 0 ([variant+0x1c] = 0)
variant.has-bit &= ~1 ([variant+0x10] &= 0xfe)
for i in 0 .. count-1:
v = (int32) data[i] // movsxd/trunc: long → int32
RepeatedField<int32>::Add(v) // GrowNoAnnotate @0xe68d9e0 if full;
// store [Rep+0x8 + size*4]
variant.physical_core_indices.current_size = i+1 ([variant+0x1c] = i+1)
variant.has-bit |= 1 ([variant+0x10] |= 1)
// FINALIZE:
cloned = CloneBackendConfigProto(bc) (@0x1e60dac0)
BackendConfigWrapper::operator=([hlo+0x68], cloned) (@0x1e60de40)
return 1
The fill is a verbatim copy: the only transformation is the long→int32 truncation that proto field 4's int32 element type forces. The selected-core IDs the producer chose are written unchanged, in the order the caller sorted them.
The AR fill (
v8 & 1branch) in the decompile:*(_DWORD*)(v9+28) = 0,*(_BYTE*)(v9+16) &= ~1u, loop readsv17 = *(_DWORD*)(v11 + 8*v14)(truncating the 8-byteSpanelement to a 4-byte store),GrowNoAnnotate<…>(v13, v9, …)on full, storesv16[(int)v14 + 2] = v17, sets*(_DWORD*)(v9+28) = v18(size+1) and*(_BYTE*)(v9+16) |= 1u. AG (v8 & 2) and RS (v8 & 4) branches are structurally identical (only the variant ptr differs). Finalize:CloneBackendConfigProto((xla*)v44, …)thenBackendConfigWrapper::operator=((char*)v53 + 104, …)=[hlo+0x68]. The success VLOG string"backend_config_util::AddCollectivePhysicalCoreIndices( inst, sparse_core_ids) is OK"is present in.rodata.
3. The read-back — GetPhysicalCoreIndices (@0x1c8692e0)
GetPhysicalCoreIndices(const HloInstruction* hlo) → StatusOr<InlinedVector<long,4>>:
bc = GetBackendConfig(hlo)
if !bc → MakeErrorImpl<13>("No backend config found", line 523)
if !(disc & 2 present) → MakeErrorImpl<13>("No collective offload config found", line 527)
select the SAME variant by the SAME discriminant bits (§1.1)
if variant absent / field empty → MakeErrorImpl<13>("No physical core indices found", line 565)
count = [variant+0x1c]
Rep* = [variant+0x20] (or inline [variant+0x18] when heap-bit clear)
for each int32 at Rep+0x8 + i*4: widen int32 → long (vpmovsxdq, 4-wide)
build InlinedVector<long,4>: count <= 4 → inline; >= 9 → exact heap; 5..8 → cap-8 heap
return StatusOr OK { sret+0=1, sret+8=data, sret+0x10=size, sret+0x18=cap }
The reader is the exact inverse of the writer: same oneof variant select, same [+0x1c]/[+0x20] field offsets, widening each stored int32 back to a long. The result type is an absl::InlinedVector<long,4> — counts ≤ 4 stay in the inline buffer; larger lists go to a heap allocation (_size_returning_new). All three error paths are MakeErrorImpl<13> (gRPC code 13 = INTERNAL) in backend_config_util.cc. The reader's consumer is CheckCoreAssignmentConsistency (@0x1c869cc0), which re-validates that the placed cores agree across the collective's instructions.
The three error strings + source line numbers read byte-exact from the decompile:
"No backend config found"(line 523),"No collective offload config found"(line 527),"No physical core indices found"(line 565), all viaabsl::status_internal::MakeErrorImpl<13>(…, "platforms/xla/service/jellyfish/lowering/backend_config_util.cc"). Theint32→longwiden is thevpmovsxdq4-wide unrolled loop; thecmp v12,5(count<5 inline) /v12 >= 9(exact) split selects theInlinedVector<long,4>storage. The OK sret layout ([+0]=1tag,[+8]data,[+0x10]size,[+0x18]cap) is byte-confirmed.
4. The Span producer — where the chosen cores come from
The Span the fill copies is built one step upstream by SparseCoreQueueAssignment::AssignQueueIDsToAsyncStart (@0x10fdf480). The full selection policy — GetAllowedCores, the five-phase SelectCores greedy filter, the cost tie-break — is owned by SC Core-Selection (Offload) and SC Core Selection; here is only the hand-off into this page's fill:
AssignQueueIDsToAsyncStart(hlo):
mega = GetMegaChipParallelism(hlo) (@0x1c867b00) // StatusOr<InlinedVector<long,4>>
split_axis0 = mega[0] >> 1 // megacore: 2 cores per chip
allowed = GetAllowedCores(hlo) (@0x10fda3c0) // btree_set<long> candidate pool
chosen = SelectCores(hlo, allowed, …) (@0x10fdc4e0) // unsorted, {phase, cost} order
__sort(chosen) (@0x10fdfde7) // ASCENDING numeric core ID
AddCollectivePhysicalCoreIndices(hlo, Span{chosen.data, chosen.size}) (@0x10fdff30)
// the SAME chosen array is ALSO appended to MegaChipParallelismConfig (repeated long, loop @0x10fdfe80)
The critical observation for this page: SelectCores returns the cores in build order ({same-ND-plane, data-dep, assignment-group, not-different-plane, fallback} × ascending cost), but the caller numerically __sorts the list before the fill. Therefore physical_core_indices is always stored in ascending physical-core-ID order, not in selection-priority order — the selection order is consumed internally and is not visible in the proto.
The sibling producer SparseCoreCollectiveOffload::OffloadCollective (@0x10fc75e0) pairs the same chosen Span between SetMegaChipParallelism (@0x1c867680, @0x10fc86f1) and AddCollectivePhysicalCoreIndices (@0x10fc871c) — confirming the placement list and the mega-chip parallelism list are one and the same array.
[CONFIRMED — symbol + cross-call]
AssignQueueIDsToAsyncStart,GetMegaChipParallelism,SetMegaChipParallelism, andOffloadCollectiveare present in*_functions.json; theAddCollectivePhysicalCoreIndicescall site@0x10fdff30(and@0x10fc871c) and the__sort @0x10fdfde7are byte-traced. [LOW] The exact per-core numeric IDSelectCoresassigns (the logical-color → physical-core bijection / tie-break arithmetic) is the selection policy's concern and is documented (as a 5-phase greedy filter, not a closed-form scorer) on the core-selection pages — this page only needs that the output is the sorted ID set.
5. The tensor_split_mode==2 per-core emission
When the offload substrate adopts the split-tensor mode (tensor_split_mode == 2, the "Adopting split tensor mode." path gated upstream — see Tensor-Split / ND-Plane), a collective's per-color rings are emitted as per-SC-core partial-tensor ops. The grouping that decides which colors collapse into one per-core op is keyed by TensorSplitPerCoreClassifier.
5.1 TensorSplitPerCoreClassifier (@0x13379de0)
TensorSplitPerCoreClassifier(UniDirRingStrategy* s) → long:
split_bit = (byte[s+0x43] != 0) // the tensor-split / per-core bool, bit0
switch (s->ici_dim() & 0x1ff): // ici_dim() = vtable+0x68 = byte[s+0x80] | 0x100
0x100 IciDim::kX → return 0 | split_bit
0x101 IciDim::kY → return 2 | split_bit
0x102 IciDim::kZ → return 4 | split_bit
0x103 IciDim::kCoresOnChip → d2d = dynamic_cast<D2DUniDirRingStrategy*>(s)
CHECK d2d != nullptr → FATAL streams "Expected
D2DUniDirRingStrategy if the ici_dim is kCoresOnChip" (.cc:156)
c = d2d->core_classifier_ ([d2d+0x90])
CHECK c >= 0 "core_classifier_ >= 0" (.h:375)
return c // NO `| split_bit` on this path
default → FATAL CHECK "strategy->ici_dim() == IciDim::kCoresOnChip" (.cc:153)
The classifier returns a per-SC-core key. For the torus-axis dims (kX → 0, kY → 2, kZ → 4 — which axis the per-core partial-tensor ring iterates) the low bit is OR'd in with the tensor-split flag (class | split_bit). For kCoresOnChip (the megacore cross-core split — the split-2 datapath proper) the classifier returns the raw D2DUniDirRingStrategy::core_classifier_ count directly, without the split_bit OR (return v9; in the decompile). The kCoresOnChip D2D strategy is the SC analog of the dense TensorCore megacore data-split.
The decompile of
@0x13379de0matches byte-exact:v3 = *((_BYTE*)this + 67) != 0([+0x43]); theici_dim()virtual via*(_QWORD*)this + 104LL(vtable+0x68) masked& 0x1FF; comparisons0x100 → v4=0,0x101 → v4=2,0x102 → v4=4, each reachingLABEL_5: LOBYTE(v2)=v3; return v4 | v2(theclass | split_bitOR);0x103→_dynamic_cast(this, typeinfo UniDirRingStrategy, typeinfo D2DUniDirRingStrategy, 0)thenv9 = v11[18](=[+0x90], 18×8) withCHECK v9 >= 0, returningv9directly (the kCoresOnChip path does not OR insplit_bit). The three diagnostic strings + source lines are byte-exact:"strategy->ici_dim() == IciDim::kCoresOnChip"(offload_collective_strategies.cc:153); the D2D null-check isCHECK(d2d_strategy != nullptr)at.cc:156streaming"Expected D2DUniDirRingStrategy if the ici_dim is kCoresOnChip";"core_classifier_ >= 0"(offload_collective_strategies.h:375, streaming"Color id is not set.").
5.2 The UniDirRingStrategy field map
The strategy fields the classifier reads (from the ImplicitUniDirRingStrategy ctor @0x1339bca0 base and the D2DUniDirRingStrategy ctor @0x1339ba60 override):
| Offset | Field | Notes |
|---|---|---|
[s+0x42] | bool | ctor bool #1 |
[s+0x43] | bool | the tensor-split / per-core flag the classifiers read |
[s+0x44] | byte RingDir | |
[s+0x80] | byte IciDim | 0 kX / 1 kY / 2 kZ / 3 kCoresOnChip; D2D ctor hardwires 3 |
[s+0x88] | long core_count | the combine key (CanBeCombined); D2D hardwires 2 |
[s+0x90] | base bool · D2D long core_classifier_ | the per-core split count; D2D ctor CHECKs target.LogicalDevicesPerChip() == 2 (.h:361) |
[s+0x91] | bool |
The D2D vtable (@0x21908db0, vptr at +0x10) resolves slot +0x30 → core_count (@0x13399000 = qword[s+0x88]) and slot +0x68 → ici_dim (@0x13399020 = byte[s+0x80] | 0x100). The | 0x100 is the AutoOr<IciDim>-engaged bit (the same AutoOr packing the offload config builder uses for HierarchicalKind — see SC-Offload Config Builder §3).
[CONFIRMED — vtable + ctor]
D2DUniDirRingStrategyis present in*_functions.json; the vtable slot relocations (+0x30 → 0x13399000,+0x68 → 0x13399020) and the D2D ctor hardwires ([+0x80]=3,[+0x88]=2,[+0x90]=trailinglongarg, and the FATALCHECK(target.LogicalDevicesPerChip() == 2)atoffload_collective_strategies.h:361) are byte-traced. [LOW] Which caller computes thecore_classifier_long(i.e. whether it equalstensor_split_factor,NumScOffloadDevices/LDPC, or the megacore 2-core count) was not traced to theAllReduceUnidirNdStrategy::TryCreatelambda that constructs the D2D strategy — the classifier returns the stored value directly; its provenance is a tensor-split-factor concern (Tensor-Split / ND-Plane).
5.3 The combine consumer — CanBeCombined (@0x13379d60)
CanBeCombined(a, b) → bool:
return a->core_count() == b->core_count() // vtable +0x30 = [s+0x88]
&& !FLAGS_xla_tpu_impure_coff_never_combine_colors_test_only
The classifier key from §5.1 plus CanBeCombined are what the DefaultStrategyCombiner<*> family (e.g. @0x133a4660 / @0x133a3c60) and MoreRelaxedStrategyCombiner use to merge per-color UniDirRingStrategy objects into one per-SC-core partial-tensor emission: colors with the same classifier key and equal core_count (and the never-combine flag off) collapse together. The simpler sibling DimPerCoreClassifier (@0x13379dc0) keys on the tensor-split bool alone (byte[s+0x43] != 0).
CanBeCombined @0x13379d60:v3 = (*(this->vtable+48))(this, a2, a3)compared to(*(a2->vtable+48))(a2)(thecore_countvirtual at+0x30), gated by!FLAGS_xla_tpu_impure_coff_never_combine_colors_test_only(flag-impl cached atqword_2231E730, read viaFlagImpl::ReadOneBool).DimPerCoreClassifier @0x13379dc0is literallyreturn *((_BYTE*)this + 67) != 0;(=byte[+0x43]). Both byte-exact. TheCanBeCombinedoverloads forSinglePhaseRSTransferStrategy/SinglePhaseAGTransferStrategyare present as separate symbols. [LOW] TheDefaultStrategyCombiner<*>merge-loop body (how the key actually groups colors and which combinedIciStrategyRingConfigthe merged per-SC-core op emits) was confirmed by call-edge toCanBeCombined/the classifier but not expanded to the per-field emission.
6. Placement + split-mode-2 — relationship table
| Quantity | Source | Role |
|---|---|---|
| chosen physical cores (sorted) | SelectCores @0x10fdc4e0 → __sort (caller) | the array written to physical_core_indices |
physical_core_indices write | AddCollectivePhysicalCoreIndices @0x1c868500 + …Helper @0x1c868920 | verbatim Span→RepeatedField<int32> copy |
| index-array layout | [variant+0x1c]=size · [+0x20]=Rep* (data Rep+0x8) | proto2 RepeatedField<int32> (field 4) |
| read-back | GetPhysicalCoreIndices @0x1c8692e0 | StatusOr<InlinedVector<long,4>> (int32→long) |
| read-back consumer | CheckCoreAssignmentConsistency @0x1c869cc0 | cross-instruction core agreement |
| shared list | MegaChipParallelismConfig repeated long (@0x10fdfe80) | same chosen array → mega-chip parallelism |
| per-core class key | TensorSplitPerCoreClassifier @0x13379de0 | axis: (axis 0/2/4) | split bool; kCoresOnChip: raw D2D core_classifier_ |
ici_dim | byte[s+0x80] | 0x100 (vtable +0x68) | kX / kY / kZ / kCoresOnChip |
| D2D per-core split | D2DUniDirRingStrategy::core_classifier_ [s+0x90] (LDPC(SC)==2) | megacore cross-core split count |
| combine key | core_count() [s+0x88] (vtable +0x30) | CanBeCombined grouping |
7. Verification notes
Cross-checked against the IDA decompile of
libtpu.sov0.0.40 (build-id89edbbe8…):
- Writer (
…Helper @0x1c868920) —GetBackendConfig→v43 == 1gate; theCollectiveOffloadConfigoneof discriminantv8 = *((_DWORD*)v6+4)([cfg+0x10]) with bits2/1/4/8/0x10and variant ptrs*((_QWORD*)v6+4)/+3/+5(=[+0x20]/[+0x18]/[+0x28]); the clear ([+0x1c]=0,[+0x10]&=~1) + the verbatimlong→int32copy loop (GrowNoAnnotate @0xe68d9e0, store[Rep+0x8+i*4],[+0x1c]++,[+0x10]|=1); theCloneBackendConfigProto+BackendConfigWrapper::operator=([hlo+0x68])write-back — all byte-exact.- Reader (
GetPhysicalCoreIndices @0x1c8692e0) — the same oneof bits;Rep*=[+0x20], count=[+0x1c], dataRep+0x8;vpmovsxdqint32→long; thecmp 5inline />= 9exact heapInlinedVector<long,4>split; the threeMakeErrorImpl<13>strings at lines523/527/565; the OK sret layout+0/+8/+0x10/+0x18— all byte-exact.- Field-offset agreement — writer
[+0x1c]/[+0x10]and reader[+0x1c]/[+0x18]/[+0x20]independently confirm theRepeatedField<int32>at[variant+0x18..+0x20].- Classifier (
TensorSplitPerCoreClassifier @0x13379de0) —byte[+0x43]read;ici_dim()vtable+0x68masked& 0x1FF;0x100/0x101/0x102 → 0/2/4;0x103→dynamic_cast<D2DUniDirRingStrategy>(CHECK != nullptr,.cc:156) +[+0x90]core_classifier_(CHECK >= 0); default FATAL. The axis paths returnaxis | split_bit; thekCoresOnChippath returns the rawcore_classifier_(return v9;) with nosplit_bitOR. Source lines153/156(.cc) and375(.h) byte-exact.- Combine (
CanBeCombined @0x13379d60,DimPerCoreClassifier @0x13379dc0) —core_count()equality via vtable+0x30, gated byFLAGS_xla_tpu_impure_coff_never_combine_colors_test_only;DimPerCoreClassifier=byte[+0x43] != 0— both byte-exact.- Symbols —
AddCollectivePhysicalCoreIndices,…Helper(+$_0<Ragged…>/$_0<AllToAll…>lambdas),GetPhysicalCoreIndices,AssignQueueIDsToAsyncStart,OffloadCollective,GetMegaChipParallelism,SetMegaChipParallelism,D2DUniDirRingStrategy,CheckCoreAssignmentConsistency, and theCanBeCombinedoverloads all present in*_functions.json.[LOW] Confirmed by structure / call-edge but not fully numeric-decoded:
- The
SelectCoresper-candidate placement arithmetic (the logical-color → physical-core bijection / tie-break) — owned by the core-selection pages; here only the sorted output matters.- The
D2DUniDirRingStrategy::core_classifier_construction site (which caller computes the storedlong, and whether it equalstensor_split_factor/NumScOffloadDevices-derived / the megacore 2-core count) — the classifier returns it directly.- The
DefaultStrategyCombiner<*>merge-loop body (how the classifier key groups colors into the emitted per-SC-coreIciStrategyRingConfig).
Cross-References
SparseCore-offload placement & selection
- SC Core-Selection (Offload) —
GetAllowedCorescandidate mask + the cost/resource model that feedsSelectCores - SC Core Selection — the five-phase
SelectCoresgreedy filter that produces the chosen-core list this page copies - Tensor-Split / ND-Plane —
tensor_split_factor/NumScOffloadDevices+NDPlaneInfo, the gate that selectstensor_split_mode==2
Config + substrate
- SC-Offload Config Builder — the
*OffloadConfigstruct family carryingphysical_core_indices, and theAutoOrpacking theIciDimread mirrors - On-Pod Collectives — Section Map — the substrate split and the SC-offload gate
Sibling subsystems
- HierarchicalKind — the
AutoOr<bool>flat-vs-hierarchical split the offload builder dispatches on - back to index