Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Physical-Core Placement

Binary: extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so (build-id 89edbbe81c5b328a958fe628a9f2207d, build libtpu_lts_20260413_b_RC00; .text VMA == file offset 0xe63c000, .rodata VMA == file offset 0x84a0000). Status: Reimplementation-grade · Evidence grade: Confirmed (byte-anchored) — the physical_core_indices write path (AddCollectivePhysicalCoreIndicesHelper), the read-back (GetPhysicalCoreIndices), and the tensor_split_mode==2 classifier (TensorSplitPerCoreClassifier) were each cross-checked against the IDA decompile of all three functions; two residual sub-layouts marked [LOW] below · Part XIII — On-Pod Collectives & Barriers / SparseCore-offload collectives · back to index

Abstract

This page documents the physical-core placement datapath of a SparseCore-offloaded embedding collective: how a collective's logical colors land on a concrete set of physical SparseCore (SC) cores, and how the per-core tensor-split (tensor_split_mode == 2) emission is keyed. It owns three byte-exact mechanisms:

  1. the physical_core_indices fillAddCollectivePhysicalCoreIndices (@0x1c868500), a recursive async/fusion walker, and its writer AddCollectivePhysicalCoreIndicesHelper (@0x1c868920), which copies the chosen-core absl::Span<long const> verbatim (truncated longint32) into the per-collective *OffloadConfig variant's repeated int32 physical_core_indices (proto field 4);
  2. the physical_core_indices index-array layout + read-back — the proto2 RepeatedField<int32> at [variant+0x18..+0x20], read back (widened to long) by GetPhysicalCoreIndices (@0x1c8692e0) into a StatusOr<absl::InlinedVector<long,4>>;
  3. the tensor_split_mode==2 per-core emission keyTensorSplitPerCoreClassifier (@0x13379de0), which maps a per-color UniDirRingStrategy to a classification key — (axis classcode 0/2/4) | (tensor-split bool) for the torus axes, or the raw D2DUniDirRingStrategy::core_classifier_ count (no split_bit OR) for kCoresOnChip — and the CanBeCombined (@0x13379d60) predicate that merges per-color strategies into one per-SC-core partial-tensor op.

The core-selection policywhich cores are eligible and the order in which SelectCores builds the chosen list — is not owned here; it lives on SC Core-Selection (Offload) and SC Core Selection. The ND-plane / tensor-split-factor derivation lives on Tensor-Split / ND-Plane. This page picks up at the moment the chosen-core Span exists and follows it into the proto, and picks up the split-mode-2 emission at the classifier that keys it.

Contract of the placement datapath as observed in the binary:

  • physical_core_indices is the sorted set of chosen physical SC core IDs produced by SparseCoreQueueAssignment::SelectCores, numerically __sorted by the caller AssignQueueIDsToAsyncStart (@0x10fdf480), then handed to AddCollectivePhysicalCoreIndices as an absl::Span<long const>.
  • The fill is purely a copy: AddCollectivePhysicalCoreIndicesHelper truncates each long to int32 and appends it to the variant's RepeatedField<int32> — there is no per-element transformation, scaling, or re-indexing.
  • The same Span drives two consumers: it is copied into physical_core_indices and appended to the MegaChipParallelismConfig repeated-long field — the placement and the mega-chip parallelism share one selected-core list.
  • The walker is recursive over fusion bodies: an async-wrapped fusion's called_computations are walked, and every sub-collective opcode in {6, 9, 12, 86, 93} receives the same Span.
  • TensorSplitPerCoreClassifier returns a per-SC-core grouping/sort discriminant, not an instruction; for IciDim::kCoresOnChip the discriminant is the D2DUniDirRingStrategy::core_classifier_ count — the megacore cross-core split (LogicalDevicesPerChip(SparseCore) == 2), the SC analog of the dense TensorCore megacore data-split.

At a glance

AspectValue (byte-anchored)
Walker (recursive)AddCollectivePhysicalCoreIndices @0x1c868500bool
Writer…Helper @0x1c868920 (anon-ns) → bool
ReaderGetPhysicalCoreIndices @0x1c8692e0StatusOr<InlinedVector<long,4>>
Producer (Span)AssignQueueIDsToAsyncStart @0x10fdf480 (call @0x10fdff30)
Sibling producerOffloadCollective @0x10fc75e0 (SetMegaChipParallelism + AddCollective…)
Index arrayproto2 RepeatedField<int32> at [variant+0x18] flags / [+0x1c] size / [+0x20] Rep* (data at Rep+0x8)
Oneof variant ptrsAR [CollectiveOffloadConfig+0x18] · AG [+0x20] · RS [+0x28] (discriminant dword [+0x10])
Walker opcode set{6, 9, 12, 86, 93} (jump table @0xb438098, mask 0x1240 + explicit cmp 0x56/0x5d)
Read-back error lines523 / 527 / 565 (backend_config_util.cc), all MakeErrorImpl<13> (INTERNAL)
Split-2 classifierTensorSplitPerCoreClassifier @0x13379de0 → axis 0/2/4 | byte[s+0x43]; kCoresOnChip → raw D2D core_classifier_ (no OR)
ici_dim()vtable +0x68 = byte[s+0x80] | 0x100 (AutoOr-engaged IciDim)
Combine keycore_count() (vtable +0x30 = qword[s+0x88]) via CanBeCombined @0x13379d60
Reader consumerCheckCoreAssignmentConsistency @0x1c869cc0

1. The physical_core_indices field — array layout

physical_core_indices is proto field 4, a repeated int32, carried inside each per-collective *OffloadConfig variant (AllGatherOffloadConfig / AllReduceOffloadConfig / ReduceScatterOffloadConfig), which are byte-identical (see SC-Offload Config Builder §4). The field is a standard proto2 RepeatedField<int32> and occupies four offsets inside the variant:

Offset (rel. variant)FieldMeaning
[variant+0x10]int32 has-bitsbit0 = physical_core_indices present
[variant+0x18]RepeatedField flags | arena taglow bit = "has heap Rep"
[variant+0x1c]int32 current_sizeelement count
[variant+0x20]Rep*element data starts at Rep+0x8 (int32[])

Both the writer and the reader agree on this layout. In …Helper @0x1c868920 the AR fill clears *(_DWORD*)(v9+28) (= [+0x1c] size) and *(_BYTE*)(v9+16) &= ~1u (= [+0x10] has-bit), then per element stores v16[(int)v14 + 2] = v17 (= Rep+0x8 + idx*4), sets *(_DWORD*)(v9+28) = size+1, and *(_BYTE*)(v9+16) |= 1u. In GetPhysicalCoreIndices @0x1c8692e0 the read picks Rep* from *(_QWORD*)(v9+32) ([+0x20]) when the heap-bit *(_BYTE*)(v9+24) & 1 ([+0x18]) is set, count from *(int*)(v9+28) ([+0x1c]), and reads element data from Rep+0x8.

NOTE — [variant+0x28] is a separate scalar that the ctor zeroes (*(_DWORD*)(v_variant+0x28)) and the fill never touches; it is not part of the physical_core_indices repeated field, which occupies [+0x18..+0x20] as tabulated above.

1.1 The oneof variant select

physical_core_indices is not a top-level *OffloadConfig field reached directly — it is reached through the CollectiveOffloadConfig oneof wrapper. Both the writer and the reader select the variant by the discriminant dword at [CollectiveOffloadConfig+0x10]:

Discriminant bitVariantVariant ptr (rel. CollectiveOffloadConfig)
& 0x1AllReduceOffloadConfig[+0x18] (*((_QWORD*)cfg+3))
& 0x2AllGatherOffloadConfig[+0x20] (*((_QWORD*)cfg+4))
& 0x4ReduceScatterOffloadConfig[+0x28] (*((_QWORD*)cfg+5))
& 0x8RaggedAllToAllOffloadConfigmutable_ragged_all_to_all_offload_config
& 0x10AllToAllOffloadConfigmutable_all_to_all_offload_config

The two *_to_all variants are handled by a templated lambda $_0<RaggedAllToAllOffloadConfig> / $_0<AllToAllOffloadConfig> rather than the inline AR/AG/RS loop, but the field layout (proto2 RepeatedField<int32> at the same relative offsets) is identical.

In …Helper, the discriminant is v8 = *((_DWORD*)v6 + 4) (= [cfg+0x10]), tested & 2 (AG, ptr *((_QWORD*)v6+4)), & 1 (AR, *((_QWORD*)v6+3)), & 4 (RS, *((_QWORD*)v6+5)), & 8 (ragged → mutable_ragged_all_to_all_offload_config(v6) + $_0), & 0x10 (a2a → mutable_all_to_all_offload_config(v6) + $_0). GetPhysicalCoreIndices selects the identical bits (v60 & 2 gates the "no collective offload config" error; then the same & 2 / & 1 / & 4 / & 8 / & 0x10).


2. The fill path — AddCollectivePhysicalCoreIndices + …Helper

2.1 The recursive walker (@0x1c868500)

AddCollectivePhysicalCoreIndices(HloInstruction*, absl::Span<long const>) → bool is not the writer; it is a recursive collective-walker that forwards the same Span to every sub-collective:

AddCollectivePhysicalCoreIndices(hlo, Span indices):
  CHECK(hlo->IsAsynchronous())                    (@0x1e592520)            // FATAL "inst->IsAsynchronous()" (.cc:433)
  wrapped = hlo->async_wrapped_instruction()       (@0x1e5aa300)
  if opcode_byte[wrapped+0xc] != 0x1b (fusion)                              → return 1
  for each instr in wrapped->called_computations() (@0x1e5885a0):
     switch opcode_byte[instr+0xc]  (jump table @0xb438098, opcode-6 index):
        6 / 9 / 12  → AddCollectivePhysicalCoreIndicesHelper(instr, indices)   // mask 0x1240
        0x56 (86)   → AddCollectivePhysicalCoreIndicesHelper(instr, indices)   // explicit cmp
        0x5d (93)   → AddCollectivePhysicalCoreIndicesHelper(instr, indices)   // explicit cmp
        0x28 (40)   → recurse into instr->called_computations()
        other       → skip
  return 1

The opcode set {6, 9, 12, 86, 93} is the SparseCore async-collective plus all-reduce / all-gather / reduce-scatter / all-to-all family. Opcode 0x28 (fusion) triggers a recursive descent, so a nested fusion body is fully walked. The walk is depth-first; the same Span reaches every matched leaf.

[CONFIRMED — symbol + structure] The walker symbol …AddCollectivePhysicalCoreIndicesEPNS_14HloInstructionEN4absl4SpanIKlEE @0x1c868500 is present in *_functions.json, with its two source call sites in AssignQueueIDsToAsyncStart (@0x10fdff30) and OffloadCollective (@0x10fc871c). The entry is guarded by a FATAL CHECK(inst->IsAsynchronous()) (backend_config_util.cc:433) — a non-async input aborts, it is not a silent no-op. The jump-table form (@0xb438098, 0x23 dwords, opcode−6 index; mask 0x1240 = bits 6/9/12; explicit cmp eax, 0x56 / 0x5d; recurse target for 0x28) is byte-traced from the disassembly. The Helper call forwards the same (rsi=data, rdx=count) pair to each matched instruction.

2.2 The writer (…Helper @0x1c868920)

AddCollectivePhysicalCoreIndicesHelper(hlo, Span data, count) → bool:
  bc = GetBackendConfig(hlo)            (@0x1c8664a0)         // absent → return 1
  cfg = bc.CollectiveOffloadConfig      ([BackendConfig+0x178], default-constructed if missing)
  select variant by discriminant [cfg+0x10]   (AR/AG/RS/ragged/a2a, §1.1)
  // FILL (byte-identical for AR/AG/RS):
  variant.physical_core_indices.current_size = 0      ([variant+0x1c] = 0)
  variant.has-bit &= ~1                                ([variant+0x10] &= 0xfe)
  for i in 0 .. count-1:
     v = (int32) data[i]                               // movsxd/trunc: long → int32
     RepeatedField<int32>::Add(v)                      // GrowNoAnnotate @0xe68d9e0 if full;
                                                        //   store [Rep+0x8 + size*4]
     variant.physical_core_indices.current_size = i+1  ([variant+0x1c] = i+1)
     variant.has-bit |= 1                              ([variant+0x10] |= 1)
  // FINALIZE:
  cloned = CloneBackendConfigProto(bc)                 (@0x1e60dac0)
  BackendConfigWrapper::operator=([hlo+0x68], cloned)  (@0x1e60de40)
  return 1

The fill is a verbatim copy: the only transformation is the longint32 truncation that proto field 4's int32 element type forces. The selected-core IDs the producer chose are written unchanged, in the order the caller sorted them.

The AR fill (v8 & 1 branch) in the decompile: *(_DWORD*)(v9+28) = 0, *(_BYTE*)(v9+16) &= ~1u, loop reads v17 = *(_DWORD*)(v11 + 8*v14) (truncating the 8-byte Span element to a 4-byte store), GrowNoAnnotate<…>(v13, v9, …) on full, stores v16[(int)v14 + 2] = v17, sets *(_DWORD*)(v9+28) = v18 (size+1) and *(_BYTE*)(v9+16) |= 1u. AG (v8 & 2) and RS (v8 & 4) branches are structurally identical (only the variant ptr differs). Finalize: CloneBackendConfigProto((xla*)v44, …) then BackendConfigWrapper::operator=((char*)v53 + 104, …) = [hlo+0x68]. The success VLOG string "backend_config_util::AddCollectivePhysicalCoreIndices( inst, sparse_core_ids) is OK" is present in .rodata.


3. The read-back — GetPhysicalCoreIndices (@0x1c8692e0)

GetPhysicalCoreIndices(const HloInstruction* hlo) → StatusOr<InlinedVector<long,4>>:
  bc = GetBackendConfig(hlo)
  if !bc                          → MakeErrorImpl<13>("No backend config found", line 523)
  if !(disc & 2 present)          → MakeErrorImpl<13>("No collective offload config found", line 527)
  select the SAME variant by the SAME discriminant bits (§1.1)
  if variant absent / field empty → MakeErrorImpl<13>("No physical core indices found", line 565)
  count = [variant+0x1c]
  Rep*  = [variant+0x20]  (or inline [variant+0x18] when heap-bit clear)
  for each int32 at Rep+0x8 + i*4:  widen int32 → long   (vpmovsxdq, 4-wide)
  build InlinedVector<long,4>:  count <= 4 → inline; >= 9 → exact heap; 5..8 → cap-8 heap
  return StatusOr OK { sret+0=1, sret+8=data, sret+0x10=size, sret+0x18=cap }

The reader is the exact inverse of the writer: same oneof variant select, same [+0x1c]/[+0x20] field offsets, widening each stored int32 back to a long. The result type is an absl::InlinedVector<long,4> — counts ≤ 4 stay in the inline buffer; larger lists go to a heap allocation (_size_returning_new). All three error paths are MakeErrorImpl<13> (gRPC code 13 = INTERNAL) in backend_config_util.cc. The reader's consumer is CheckCoreAssignmentConsistency (@0x1c869cc0), which re-validates that the placed cores agree across the collective's instructions.

The three error strings + source line numbers read byte-exact from the decompile: "No backend config found" (line 523), "No collective offload config found" (line 527), "No physical core indices found" (line 565), all via absl::status_internal::MakeErrorImpl<13>(…, "platforms/xla/service/jellyfish/lowering/backend_config_util.cc"). The int32long widen is the vpmovsxdq 4-wide unrolled loop; the cmp v12,5 (count<5 inline) / v12 >= 9 (exact) split selects the InlinedVector<long,4> storage. The OK sret layout ([+0]=1 tag, [+8] data, [+0x10] size, [+0x18] cap) is byte-confirmed.


4. The Span producer — where the chosen cores come from

The Span the fill copies is built one step upstream by SparseCoreQueueAssignment::AssignQueueIDsToAsyncStart (@0x10fdf480). The full selection policyGetAllowedCores, the five-phase SelectCores greedy filter, the cost tie-break — is owned by SC Core-Selection (Offload) and SC Core Selection; here is only the hand-off into this page's fill:

AssignQueueIDsToAsyncStart(hlo):
  mega = GetMegaChipParallelism(hlo)              (@0x1c867b00)  // StatusOr<InlinedVector<long,4>>
  split_axis0 = mega[0] >> 1                       // megacore: 2 cores per chip
  allowed = GetAllowedCores(hlo)                  (@0x10fda3c0)  // btree_set<long> candidate pool
  chosen  = SelectCores(hlo, allowed, …)          (@0x10fdc4e0)  // unsorted, {phase, cost} order
  __sort(chosen)                                  (@0x10fdfde7)  // ASCENDING numeric core ID
  AddCollectivePhysicalCoreIndices(hlo, Span{chosen.data, chosen.size})   (@0x10fdff30)
  // the SAME chosen array is ALSO appended to MegaChipParallelismConfig (repeated long, loop @0x10fdfe80)

The critical observation for this page: SelectCores returns the cores in build order ({same-ND-plane, data-dep, assignment-group, not-different-plane, fallback} × ascending cost), but the caller numerically __sorts the list before the fill. Therefore physical_core_indices is always stored in ascending physical-core-ID order, not in selection-priority order — the selection order is consumed internally and is not visible in the proto.

The sibling producer SparseCoreCollectiveOffload::OffloadCollective (@0x10fc75e0) pairs the same chosen Span between SetMegaChipParallelism (@0x1c867680, @0x10fc86f1) and AddCollectivePhysicalCoreIndices (@0x10fc871c) — confirming the placement list and the mega-chip parallelism list are one and the same array.

[CONFIRMED — symbol + cross-call] AssignQueueIDsToAsyncStart, GetMegaChipParallelism, SetMegaChipParallelism, and OffloadCollective are present in *_functions.json; the AddCollectivePhysicalCoreIndices call site @0x10fdff30 (and @0x10fc871c) and the __sort @0x10fdfde7 are byte-traced. [LOW] The exact per-core numeric ID SelectCores assigns (the logical-color → physical-core bijection / tie-break arithmetic) is the selection policy's concern and is documented (as a 5-phase greedy filter, not a closed-form scorer) on the core-selection pages — this page only needs that the output is the sorted ID set.


5. The tensor_split_mode==2 per-core emission

When the offload substrate adopts the split-tensor mode (tensor_split_mode == 2, the "Adopting split tensor mode." path gated upstream — see Tensor-Split / ND-Plane), a collective's per-color rings are emitted as per-SC-core partial-tensor ops. The grouping that decides which colors collapse into one per-core op is keyed by TensorSplitPerCoreClassifier.

5.1 TensorSplitPerCoreClassifier (@0x13379de0)

TensorSplitPerCoreClassifier(UniDirRingStrategy* s) → long:
  split_bit = (byte[s+0x43] != 0)                  // the tensor-split / per-core bool, bit0
  switch (s->ici_dim() & 0x1ff):                   // ici_dim() = vtable+0x68 = byte[s+0x80] | 0x100
     0x100  IciDim::kX           → return 0 | split_bit
     0x101  IciDim::kY           → return 2 | split_bit
     0x102  IciDim::kZ           → return 4 | split_bit
     0x103  IciDim::kCoresOnChip → d2d = dynamic_cast<D2DUniDirRingStrategy*>(s)
                                    CHECK d2d != nullptr  → FATAL streams "Expected
                                       D2DUniDirRingStrategy if the ici_dim is kCoresOnChip" (.cc:156)
                                    c = d2d->core_classifier_   ([d2d+0x90])
                                    CHECK c >= 0     "core_classifier_ >= 0"   (.h:375)
                                    return c                    // NO `| split_bit` on this path
     default → FATAL CHECK "strategy->ici_dim() == IciDim::kCoresOnChip"   (.cc:153)

The classifier returns a per-SC-core key. For the torus-axis dims (kX → 0, kY → 2, kZ → 4 — which axis the per-core partial-tensor ring iterates) the low bit is OR'd in with the tensor-split flag (class | split_bit). For kCoresOnChip (the megacore cross-core split — the split-2 datapath proper) the classifier returns the raw D2DUniDirRingStrategy::core_classifier_ count directly, without the split_bit OR (return v9; in the decompile). The kCoresOnChip D2D strategy is the SC analog of the dense TensorCore megacore data-split.

The decompile of @0x13379de0 matches byte-exact: v3 = *((_BYTE*)this + 67) != 0 ([+0x43]); the ici_dim() virtual via *(_QWORD*)this + 104LL (vtable +0x68) masked & 0x1FF; comparisons 0x100 → v4=0, 0x101 → v4=2, 0x102 → v4=4, each reaching LABEL_5: LOBYTE(v2)=v3; return v4 | v2 (the class | split_bit OR); 0x103_dynamic_cast(this, typeinfo UniDirRingStrategy, typeinfo D2DUniDirRingStrategy, 0) then v9 = v11[18] (= [+0x90], 18×8) with CHECK v9 >= 0, returning v9 directly (the kCoresOnChip path does not OR in split_bit). The three diagnostic strings + source lines are byte-exact: "strategy->ici_dim() == IciDim::kCoresOnChip" (offload_collective_strategies.cc:153); the D2D null-check is CHECK(d2d_strategy != nullptr) at .cc:156 streaming "Expected D2DUniDirRingStrategy if the ici_dim is kCoresOnChip"; "core_classifier_ >= 0" (offload_collective_strategies.h:375, streaming "Color id is not set.").

5.2 The UniDirRingStrategy field map

The strategy fields the classifier reads (from the ImplicitUniDirRingStrategy ctor @0x1339bca0 base and the D2DUniDirRingStrategy ctor @0x1339ba60 override):

OffsetFieldNotes
[s+0x42]boolctor bool #1
[s+0x43]boolthe tensor-split / per-core flag the classifiers read
[s+0x44]byte RingDir
[s+0x80]byte IciDim0 kX / 1 kY / 2 kZ / 3 kCoresOnChip; D2D ctor hardwires 3
[s+0x88]long core_countthe combine key (CanBeCombined); D2D hardwires 2
[s+0x90]base bool · D2D long core_classifier_the per-core split count; D2D ctor CHECKs target.LogicalDevicesPerChip() == 2 (.h:361)
[s+0x91]bool

The D2D vtable (@0x21908db0, vptr at +0x10) resolves slot +0x30core_count (@0x13399000 = qword[s+0x88]) and slot +0x68ici_dim (@0x13399020 = byte[s+0x80] | 0x100). The | 0x100 is the AutoOr<IciDim>-engaged bit (the same AutoOr packing the offload config builder uses for HierarchicalKind — see SC-Offload Config Builder §3).

[CONFIRMED — vtable + ctor] D2DUniDirRingStrategy is present in *_functions.json; the vtable slot relocations (+0x30 → 0x13399000, +0x68 → 0x13399020) and the D2D ctor hardwires ([+0x80]=3, [+0x88]=2, [+0x90]= trailing long arg, and the FATAL CHECK(target.LogicalDevicesPerChip() == 2) at offload_collective_strategies.h:361) are byte-traced. [LOW] Which caller computes the core_classifier_ long (i.e. whether it equals tensor_split_factor, NumScOffloadDevices/LDPC, or the megacore 2-core count) was not traced to the AllReduceUnidirNdStrategy::TryCreate lambda that constructs the D2D strategy — the classifier returns the stored value directly; its provenance is a tensor-split-factor concern (Tensor-Split / ND-Plane).

5.3 The combine consumer — CanBeCombined (@0x13379d60)

CanBeCombined(a, b) → bool:
  return a->core_count() == b->core_count()                          // vtable +0x30 = [s+0x88]
      && !FLAGS_xla_tpu_impure_coff_never_combine_colors_test_only

The classifier key from §5.1 plus CanBeCombined are what the DefaultStrategyCombiner<*> family (e.g. @0x133a4660 / @0x133a3c60) and MoreRelaxedStrategyCombiner use to merge per-color UniDirRingStrategy objects into one per-SC-core partial-tensor emission: colors with the same classifier key and equal core_count (and the never-combine flag off) collapse together. The simpler sibling DimPerCoreClassifier (@0x13379dc0) keys on the tensor-split bool alone (byte[s+0x43] != 0).

CanBeCombined @0x13379d60: v3 = (*(this->vtable+48))(this, a2, a3) compared to (*(a2->vtable+48))(a2) (the core_count virtual at +0x30), gated by !FLAGS_xla_tpu_impure_coff_never_combine_colors_test_only (flag-impl cached at qword_2231E730, read via FlagImpl::ReadOneBool). DimPerCoreClassifier @0x13379dc0 is literally return *((_BYTE*)this + 67) != 0; (= byte[+0x43]). Both byte-exact. The CanBeCombined overloads for SinglePhaseRSTransferStrategy / SinglePhaseAGTransferStrategy are present as separate symbols. [LOW] The DefaultStrategyCombiner<*> merge-loop body (how the key actually groups colors and which combined IciStrategyRingConfig the merged per-SC-core op emits) was confirmed by call-edge to CanBeCombined/the classifier but not expanded to the per-field emission.


6. Placement + split-mode-2 — relationship table

QuantitySourceRole
chosen physical cores (sorted)SelectCores @0x10fdc4e0__sort (caller)the array written to physical_core_indices
physical_core_indices writeAddCollectivePhysicalCoreIndices @0x1c868500 + …Helper @0x1c868920verbatim SpanRepeatedField<int32> copy
index-array layout[variant+0x1c]=size · [+0x20]=Rep* (data Rep+0x8)proto2 RepeatedField<int32> (field 4)
read-backGetPhysicalCoreIndices @0x1c8692e0StatusOr<InlinedVector<long,4>> (int32long)
read-back consumerCheckCoreAssignmentConsistency @0x1c869cc0cross-instruction core agreement
shared listMegaChipParallelismConfig repeated long (@0x10fdfe80)same chosen array → mega-chip parallelism
per-core class keyTensorSplitPerCoreClassifier @0x13379de0axis: (axis 0/2/4) | split bool; kCoresOnChip: raw D2D core_classifier_
ici_dimbyte[s+0x80] | 0x100 (vtable +0x68)kX / kY / kZ / kCoresOnChip
D2D per-core splitD2DUniDirRingStrategy::core_classifier_ [s+0x90] (LDPC(SC)==2)megacore cross-core split count
combine keycore_count() [s+0x88] (vtable +0x30)CanBeCombined grouping

7. Verification notes

Cross-checked against the IDA decompile of libtpu.so v0.0.40 (build-id 89edbbe8…):

  • Writer (…Helper @0x1c868920) — GetBackendConfigv43 == 1 gate; the CollectiveOffloadConfig oneof discriminant v8 = *((_DWORD*)v6+4) ([cfg+0x10]) with bits 2/1/4/8/0x10 and variant ptrs *((_QWORD*)v6+4)/+3/+5 (= [+0x20]/[+0x18]/[+0x28]); the clear ([+0x1c]=0, [+0x10]&=~1) + the verbatim longint32 copy loop (GrowNoAnnotate @0xe68d9e0, store [Rep+0x8+i*4], [+0x1c]++, [+0x10]|=1); the CloneBackendConfigProto + BackendConfigWrapper::operator=([hlo+0x68]) write-back — all byte-exact.
  • Reader (GetPhysicalCoreIndices @0x1c8692e0) — the same oneof bits; Rep*=[+0x20], count=[+0x1c], data Rep+0x8; vpmovsxdq int32long; the cmp 5 inline / >= 9 exact heap InlinedVector<long,4> split; the three MakeErrorImpl<13> strings at lines 523/527/565; the OK sret layout +0/+8/+0x10/+0x18 — all byte-exact.
  • Field-offset agreement — writer [+0x1c]/[+0x10] and reader [+0x1c]/[+0x18]/[+0x20] independently confirm the RepeatedField<int32> at [variant+0x18..+0x20].
  • Classifier (TensorSplitPerCoreClassifier @0x13379de0) — byte[+0x43] read; ici_dim() vtable +0x68 masked & 0x1FF; 0x100/0x101/0x102 → 0/2/4; 0x103dynamic_cast<D2DUniDirRingStrategy> (CHECK != nullptr, .cc:156) + [+0x90] core_classifier_ (CHECK >= 0); default FATAL. The axis paths return axis | split_bit; the kCoresOnChip path returns the raw core_classifier_ (return v9;) with no split_bit OR. Source lines 153/156 (.cc) and 375 (.h) byte-exact.
  • Combine (CanBeCombined @0x13379d60, DimPerCoreClassifier @0x13379dc0) — core_count() equality via vtable +0x30, gated by FLAGS_xla_tpu_impure_coff_never_combine_colors_test_only; DimPerCoreClassifier = byte[+0x43] != 0 — both byte-exact.
  • SymbolsAddCollectivePhysicalCoreIndices, …Helper (+ $_0<Ragged…> / $_0<AllToAll…> lambdas), GetPhysicalCoreIndices, AssignQueueIDsToAsyncStart, OffloadCollective, GetMegaChipParallelism, SetMegaChipParallelism, D2DUniDirRingStrategy, CheckCoreAssignmentConsistency, and the CanBeCombined overloads all present in *_functions.json.

[LOW] Confirmed by structure / call-edge but not fully numeric-decoded:

  • The SelectCores per-candidate placement arithmetic (the logical-color → physical-core bijection / tie-break) — owned by the core-selection pages; here only the sorted output matters.
  • The D2DUniDirRingStrategy::core_classifier_ construction site (which caller computes the stored long, and whether it equals tensor_split_factor / NumScOffloadDevices-derived / the megacore 2-core count) — the classifier returns it directly.
  • The DefaultStrategyCombiner<*> merge-loop body (how the classifier key groups colors into the emitted per-SC-core IciStrategyRingConfig).

Cross-References

SparseCore-offload placement & selection

  • SC Core-Selection (Offload)GetAllowedCores candidate mask + the cost/resource model that feeds SelectCores
  • SC Core Selection — the five-phase SelectCores greedy filter that produces the chosen-core list this page copies
  • Tensor-Split / ND-Planetensor_split_factor / NumScOffloadDevices + NDPlaneInfo, the gate that selects tensor_split_mode==2

Config + substrate

Sibling subsystems