TensorCore Barrier Assignment and InitializeOnScs
Every address, field offset, opcode value, and enum value on this page was read from
libtpu.soin thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d; buildlibtpu_lts_20260413_b_RC00; 781,691,048 B, not stripped — full C++ symbols)..textVMA equals file offset at0xe63c000;.rodata/.lrodataare identity-mapped. All addresses are VMA. Other wheel versions differ.
Abstract
This page owns two byte-anchored pieces that bracket the TensorCore (TC) on-chip barrier path. The first is the TC barrier-assignment pass — TensorCoreBarrierAssignment::Run @0x109c7420 and its per-key kind selector DetermineBarrierConfigForKey @0x109c6fa0 — the TensorCore counterpart of the SparseCore coloring/assignment. It walks every TC collective, builds a TensorCoreBarrierKey per op, feeds the two greedy coloring passes' conflict set in, and assigns each distinct key a BarrierConfig {type, id} that is written back into the collective's HLO BackendConfig. The second is ExplicitUniDirRingStrategy::InitializeOnScs @0x1337aa60 and its strategy+0x98 lookup-callback (ExplicitRingRecord @0x133a9a40 for the plain ring, ExplicitAllToAllRingRecord @0x133a94a0 for the all-to-all twin): the SparseCore-side runtime init that turns the static explicit ring-config table into a programmed next_chip / ordinal / reorder triple bound onto the strategy object.
These are two distinct subsystems joined by the chip SFLAG model. The TC assignment pass decides a barrier kind per key (global vs per-key) and the InitializeOnScs callback binds the per-ring transfer geometry the SC sequencer's barrier sync_add targets — both ultimately program the same chip SFLAG block. The generic greedy coloring engine is owned by Barrier Coloring (this page consumes its conflict set, it does not re-derive it); the BarrierConfig.id → concrete SFLAG-number formulas are owned by Barrier → SFLAG Binding (this page produces the {type, id}, it does not lower it). The BarrierType enum, the InferBarrierConfig normaliser, and the reserved-block geometry are on the overview.
For reimplementation, the contract is:
- TC dedups by graph color, SC by static-key hash.
Runruns the twoBarrierColoringpasses (Barrier Coloring), merges their conflict sets, then per distinctTensorCoreBarrierKeycallsDetermineBarrierConfigForKey(key, config, has_conflict). A non-conflicting key may share aREPLICA(2)or take aGLOBAL(1)barrier; a conflicting key (in the merged conflict set) is forced to a freshCUSTOM(3)id. DetermineBarrierConfigForKeywrites only{1,2,3}. Type at message+0x20, id at+0x18, presence hasbits+0x10 |= 3.GLOBAL(1)→id = -1;REPLICA(2)/CUSTOM(3)→id = replica_group_count(key+0x10), with2chosen when astd::map<key, BarrierConfig>dedup finds the key already present.MEGACORE(4)is never written here.IsGlobalBarrierBeneficialis a narrow heuristic. Global is "beneficial" only for a channelledall-to-all(0xc)whose tested topology axis is unit-sized and which spans a single replica group — the degenerate case where a custom per-key barrier would synchronise a trivial group.InitializeOnScsis the static-config → hardware final hop. It folds the runtime core index byCoresPerChip / LogicalDevicesPerChip, then calls[strategy+0x98]with the captured table descriptor (strategy+0x88) and the runtimeglobal_core_id/ordinal; the callback indexes the const-literals table and returns(next_chip, ordinal, reorder), written tostrategy+0x58(ordinal_) /+0x60(next_chip_) /+0x78(reordering_map_).next_chip_is what the per-ring SyncAdd targets.
| TC assignment pass | xla::jellyfish::TensorCoreBarrierAssignment::Run @0x109c7420 |
| Per-key kind selector | DetermineBarrierConfigForKey @0x109c6fa0 (writes {1,2,3}; never 4) |
| Global heuristic | IsGlobalBarrierBeneficial @0x109c6ee0 |
| Collective filter | ForEachCollective @0x109c7060 (opcode bitmask 0x1400001340 + 0x56/0x5d/0x31) |
| Key | TensorCoreBarrierKey ctor @0x109d6200, operator< @0x109d6620 |
| Coloring engine (sibling) | BarrierColoring<…>::Run @0x109cf600 / 0x109d1a60 → Barrier Coloring |
| Config carrier | BackendConfig.BarrierConfig {type @+0x20, id @+0x18, hasbits @+0x10} |
| Explicit-ring runtime init | ExplicitUniDirRingStrategy::InitializeOnScs @0x1337aa60 |
| strategy+0x98 callback (plain) | ExplicitRingRecord lambda @0x133a9a40 ($_2, kind selector 0x4) |
| strategy+0xb0 callback (a2a) | ExplicitAllToAllRingRecord lambda @0x133a94a0 ($_1, kind selector 0x2) |
| Writeback | strategy+0x58 = ordinal_, +0x60 = next_chip_, +0x78 = reordering_map_ |
| Confidence | CONFIRMED (both bodies decompiled and re-checked) unless a row says otherwise |
1. The TC barrier-assignment pass
1.1 Where Run sits — the SC/TC fork
BarrierAssignment::RunImpl @0x109c8c00 is the shared entry. It receives a flat_hash_set<string_view> of XLA thread-pool names and forks on which threads are present:
kTensorCoreThread@0x217f78d8present → build aTensorCoreBarrierAssignmenton the stack from the base config fields and callRun@0x109c8d6e.kSparseCoreThread@0x217f5ca8present → build aSparseCoreBarrierAssignmentand call itsRun@0x109c8e8e.
The two result hasbits OR together, so a module compiled for both a TensorCore thread and a SparseCore thread runs both passes. The on-chip barrier model is the union of the colored-map TC barriers and the static-keyed SC barriers; both program chip SFLAGs (overview §1), differing only in how a barrier identity is chosen.
BarrierAssignment::RunImpl @0x109c8c00
├─ if kTensorCoreThread present → TensorCoreBarrierAssignment::Run @0x109c7420 (THIS PAGE §1.2)
│ coloring ×2 → conflict set → per-key DetermineBarrierConfigForKey → BackendConfig writeback
└─ if kSparseCoreThread present → SparseCoreBarrierAssignment::Run @0x109c8e8e (static ring-key dedup)
result.changed = TC.changed | SC.changed
1.2 TensorCoreBarrierAssignment::Run @0x109c7420
Run is the TC counterpart of the SparseCore coloring/assignment. Its five phases:
- Build the policy base. Construct an
AsyncOpPolicy<TensorCoreBarrierKey>(vtable @0x217f7a68) carrying the megacore byte (this+0x18), the core-idstd::function<int(HloInstruction*)>(this+0x58), and two policy bools (this+0x60,this+0x68). Re-vtable it toAsyncCollectivePermutePolicy<TensorCoreBarrierKey>(vtable @0x217f7ac8). - First coloring pass.
BarrierColoring<AsyncCollectivePermutePolicy<…>>::Run@0x109cf600→StatusOr<pair<map<HloInstruction*,long>, flat_hash_set<HloInstruction*>>>: a per-op color and a conflict-op set. Merge the colored ops into astd::map<HloInstruction*,long>(frame-0x158) and the conflict ops into aFlatHashSet(frame-0x140). - Second coloring pass. Re-vtable the policy to
AsyncAllToAllWithAsyncBarrierPolicy<TensorCoreBarrierKey>(vtable @0x217f7b30) and runBarrierColoring<…>::Run@0x109d1a60; merge its coloring + conflict set into the same-0x158map /-0x140set. - Collect collectives.
ForEachCollective@0x109c7060(§1.4) walks every TC-thread collective, builds eachTensorCoreBarrierKey, and appends(key, InlinedVector<HloInstruction*, 2>)pairs into a vector (entry stride0x78).std::__stable_sortwith theRun::$_2comparator @0x109ccdc0makes the iteration order deterministic. - Per-key assign loop (
@0x109c7ed0 … 0x109c8250,r12= entry, stride0x78):- look up the entry's first instruction in the merged conflict
FlatHashSet(raw_hash_set::find@0x109cf200) →has_conflict; config = HloModule->config(module+0x20);DetermineBarrierConfigForKey(&result, key, config, has_conflict)@0x109c6fa0(§1.3) → aBarrierConfig;- ragged-all-to-all special-case @
0x109c7f89: if opcode== ragged-all-to-all(0x56)AND the resulting kind!= 3(not a fresh per-key id) AND the policy boolthis+0x68is set → divert to the a2a async-barrier special handling (jump0x109c8371); - for every
HloInstructionin the entry'sInlinedVector(each colored op sharing this key): open itsBackendConfig(BackendConfigWrapper::GetProto@0x1e60dc60), set itsBarrierConfigsub-message = theDetermineBarrierConfigForKeyresult (BarrierConfig::CopyFrom@0x1d6f1fc0), set theBackendConfighasbit (or BYTE [-0x600], 0x10), and re-serialise viaCloneBackendConfigProto@0x1e60dac0+BackendConfigWrapper::operator=@0x1e60de40.
- look up the entry's first instruction in the merged conflict
NOTE — dedup by color, not by key hash. Ops the coloring engine proves can share a barrier (non-conflicting in the async dependency graph) get the same color and can share a barrier id; ops with the same
TensorCoreBarrierKeybut a graph conflict are split apart. This is the structural difference from the SparseCoreFlatHashMapdedup, which keys purely on the static ring-config and has no notion of schedule overlap. The interference-graph construction and first-fit color search are owned by Barrier Coloring; this page only consumes the conflict set ashas_conflict.
1.3 DetermineBarrierConfigForKey @0x109c6fa0 — the kind selector
Effective signature: BarrierConfig DetermineBarrierConfigForKey(const TensorCoreBarrierKey& key, const HloModuleConfig& config, bool force_global). The decompiled body is short and pins every store:
// 0x109c6fa0 (this = &result BarrierConfig; a2 = key; a3 = config; a5 = has_conflict/force_global)
BarrierConfig::BarrierConfig(this, 0); // zero-init result
beneficial = IsGlobalBarrierBeneficial(key, config); // 0x109c6ee0 (§1.5)
if (force_global || beneficial) { // → GLOBAL
LABEL_global:
*(int *)(this + 0x20) = 1; // type = GLOBAL(1)
*(long *)(this + 0x18) = -1; // id = -1 (none)
*(char *)(this + 0x10) |= 3; // hasbits: type+id present
return this;
}
v9 = *(long*)(key + 0x10); // replica_group_count
if (v9 == *(long*)(key + 0x20) - 1) { // spans all-but-one group (num_groups - 1)
if (*(char*)(config + 0x50) != 1) { // "use global on saturation" flag NOT set
*(int *)(this + 0x20) = 2; // type = REPLICA(2) (shared)
*(long*)(this + 0x18) = v9; // id = replica_group_count
*(char*)(this + 0x10) |= 3;
return this;
}
goto LABEL_global; // saturation + flag set → GLOBAL(1)
}
// ELSE: provisional fresh per-key
*(int *)(this + 0x20) = 3; // type = CUSTOM(3) (fresh)
*(long*)(this + 0x18) = v9; // id = replica_group_count (provisional)
*(char*)(this + 0x10) |= 3;
v11 = std::map<key,BarrierConfig>::__try_key_extraction_impl(key, key, config, this); // 0x109cbd60
if ((dl & 1) == 0) // emplace did NOT insert ⇒ key already present (cache HIT)
BarrierConfig::CopyFrom(this, (BarrierConfig*)(v11 + 0x80)); // adopt the existing config (its type==2)
return this; // cache MISS ⇒ keep the fresh type-3 config
The numeric layout (this+0x20 is the DWORD type at byte offset 0x20; this+0x18 is the QWORD id; this+0x10 is the hasbit byte):
| Outcome | type @+0x20 | id @+0x18 | when |
|---|---|---|---|
GLOBAL | 1 | -1 | force_global OR IsGlobalBarrierBeneficial OR (saturation AND config+0x50 == 1) |
REPLICA (shared) | 2 | replica_group_count | saturation (rgc == num_groups − 1) AND config+0x50 != 1; or map dedup found the key present |
CUSTOM (fresh) | 3 | replica_group_count | otherwise, on a std::map cache miss |
GOTCHA — the saturation arm. The saturation case (
replica_group_count == num_groups − 1) does not simply produce GLOBAL. It produces a sharedREPLICA(2)(id = the realreplica_group_count) unless the module-config flagconfig+0x50 == 1, in which case it falls toGLOBAL(1). The2write (*(int*)(this+0x20) = 2) at0x109c704dis reached both from the saturation arm and from thestd::mapcache-hitCopyFrom.MEGACORE(4)is never written — there is nomovl $4into+0x20in this function (full-.textxref; overview §2).
1.4 ForEachCollective @0x109c7060 — which ops are TC collectives
ForEachCollective iterates HloModule::computations(kTensorCoreThread) @0x10944b40, walks each instruction, and invokes the callback for each op whose opcode byte (instr+0xc) is in the collective set. The filter (@0x109c72b5 … 0x109c736d):
- a bitmask
bt 0x1400001340, opcodeoveropcode <= 0x24— set bits:all-gather(0x06),all-gather-start(0x08),all-reduce(0x09),all-to-all(0x0c),collective-permute(0x22),collective-permute-start(0x24); cmp 0x56→ragged-all-to-allaccepted;cmp 0x5d→reduce-scatteraccepted;- otherwise (
custom-call(0x31)):GetCustomCallCollectiveId@0x109d6540reads theCustomCallConfigbackend-config field (hasbit0x40, returns+0x78) → accepted iff it carries a TPU collective id.
The callback also threads the conflict graph through async-start/async-done pairs by recursing into operand positions (instr+0x40 / instr+0x48).
| byte | mnemonic | filter mechanism | barrier-key role |
|---|---|---|---|
0x06 | all-gather | bitmask | replica-group keyed |
0x08 | all-gather-start | bitmask | replica-group keyed (async start) |
0x09 | all-reduce | bitmask | replica-group keyed |
0x0c | all-to-all | bitmask | replica-group keyed; only opcode eligible for IsGlobalBarrierBeneficial |
0x22 | collective-permute | bitmask | source-target-pair keyed (+0x20) |
0x24 | collective-permute-start | bitmask | source-target-pair keyed (async start) |
0x31 | custom-call | GetCustomCallCollectiveId | accepted iff it has a TPU collective id |
0x56 | ragged-all-to-all | eq-check | replica-group keyed; +0x51/+0x58 special-case |
0x5d | reduce-scatter | eq-check | replica-group keyed (also the a2a-5d rewrite target) |
NOTE — opcode→mnemonic map. The map is recovered from the
HloOpcodeStringlength array @0x421c9c0(lengths10/16/10/10/18/24/11/17/14matchall-gather/all-gather-start/all-reduce/all-to-all/collective-permute/collective-permute-start/custom-call/ragged-all-to-all/reduce-scatter) plus.lrodatastring presence; thechar*table itself is relocated, so values are resolved by length + presence.
1.5 IsGlobalBarrierBeneficial @0x109c6ee0 — the heuristic
Byte-exact body:
// 0x109c6ee0 (key = a1; config = a2) → bool
if (*(char*)(key + 0x51) == 0) // "barrier-heuristic candidate" byte not set
return false; // (VLOG(10) diagnostic path)
if (*(byte*)(key + 0x00) != 0x0c) // opcode != all-to-all
return false;
if (*(char*)(key + 0x50) != 0) // channel-id parity (cross-module) != 0
return false;
if (*(long*)(config + 0x170) == 1 || // num_partitions / replica_count along axis == 1 (cmpq)
*(long*)(config + 0x178) == 1)
return (*(long*)(key + 0x10) == 1); // replica_group_count == 1 ⇒ single group
return false;
A global barrier is "beneficial" exactly for a single-replica-group, channelled all-to-all on a topology axis whose tested partition/replica count is 1 — the degenerate all-to-all where a per-key barrier would synchronise a trivial group and a single shared global barrier is strictly cheaper.
GOTCHA —
config+0x170/config+0x178. The two== 1tests are 64-bitcmpqcompares (cmpq $0x1,0x170(%rdx)/cmpq $0x1,0x178(%rdx)@0x109c6ef3/0x109c6efd), byte-confirmed at those offsets and read as "single-device-along-axis" from use, but their proto field names (num_partitionsvsreplica_countvs a derived device count ofHloModuleConfig) are inferred, not field-matched against the descriptor. The behaviour (==1on either ⇒ testreplica_group_count == 1, itself acmpq $0x1,0x10(%rsi)) is CERTAIN; the field naming is LOW.
1.6 TensorCoreBarrierKey ctor @0x109d6200 / operator< @0x109d6620
The key is what two collectives must agree on to be candidates for sharing a barrier (the coloring then decides whether they actually may). Ctor TensorCoreBarrierKey(HloInstruction* hlo, function<int(HloInstruction*)> core_id_fn, bool b):
| Offset | Field | Source |
|---|---|---|
+0x00 | opcode byte | hlo+0xc; all-reduce(0x09) is rewritten to reduce-scatter(0x5d) when its computation has ≥2 caller instructions (cmpb $0x9 @0x109d6297 → caller_instructions() ≥ 2 → movb $0x5d @0x109d62db) |
+0x08…+0x18 | sorted vector<ReplicaGroup> | copied from hlo+0xd0/hlo+0xd8 (__assign_with_size @0x109d6f60), __introsort'd @0x109d7580 (order-independent); +0x10 = group count |
+0x20…+0x28 | sorted vector<pair<long,long>> | source-target pairs (for collective-permute 0x22/0x24): __assign_with_size @0x109d7220 + __introsort @0x109da4e0 |
+0x38 | custom-call config scalar | all-to-all(0xc) custom-calls: reads CustomCallConfig+0x78 when hasbit 0x40 set; also sets +0x50 true |
+0x40 | core-id (int) | -1 default, else core_id_fn(hlo) (invoked via [fn+0x10] when the function target is non-null) |
+0x50 | channel-id parity | hlo->channel_id() & 1 (HloInstruction::channel_id @0x1e59ff80) — the byte IsGlobalBarrierBeneficial gates on |
+0x51 | heuristic-candidate flag | the byte IsGlobalBarrierBeneficial gates on (cmpb $0x0,0x51(%rdi) @0x109c6ee0); not written by this ctor — the ctor only zero-touches +0x08…+0x37, +0x38, +0x40, +0x48, +0x50, +0x58, so the candidate byte is set on a different path (standing gap) |
+0x58 | secondary discriminant | -1 sentinel written only when the ctor's bool b is set AND opcode == ragged-all-to-all(0x56) (test %r15b,%r15b @0x109d64dd → cmpb $0x56,0xc(%r14) @0x109d64e2 → movq $-1,0x58(%rbx) @0x109d64e9); the ragged-all-to-all global-disable case |
operator< @0x109d6620 compares in order: +0x58, +0x38, +0x00 (opcode), +0x50 (channel parity), then the replica-group vector (+0x10 count, then per-group sorted device ids). Two collectives share a TC barrier key iff same opcode, same channel parity, same custom-call scalar, same secondary discriminant, and identical sorted replica-group membership — the dense-collective analog of the SparseCore ring key, keyed on HLO replica groups rather than ICI ring offsets.
2. ExplicitUniDirRingStrategy::InitializeOnScs and the strategy+0x98 callback
This is a different subsystem from §1: the SparseCore-side ring strategy runtime init. Where the TC pass decides a barrier kind, InitializeOnScs binds the per-ring transfer geometry the SC sequencer's barrier sync_add will target. It is the final hop of the static IciStrategyRingConfig EXPLICIT-table → programmed next_chip bridge.
2.1 InitializeOnScs @0x1337aa60 — the fold + callback dispatch
System-V args: this = strategy, OpBuilder&, LocationGenerator, then three Values (scs ordinal, scs partition, a third Value triple). The decompiled body confirms every step:
// 0x1337aa60 (a1 = strategy; a2 = OpBuilder; a5 = scs partition Value; a3 = scs ordinal Value)
if (!a5) // line 351
return RetCheckFailSlowPath("config != nullptr");
UniDirRingStrategy::InitializeOnScs(a1, a2, a3, a4, ...); // base init; stores +0x68/+0x70
if (base != OK) return AddSourceLocationImpl(354);
CoreIndex = OffloadFactory::GetCoreIndex(a1+8, a2, a4); // 0x133e6aa0 (per-replica core index)
v17 = *(a1+8); // OffloadFactory / Target subobject
v18 = Target::CoresPerChip(v17, /*SC*/2); // 0x1d615b40
v19 = Target::LogicalDevicesPerChip(v17, /*SC*/2); // 0x1d615b00
fold = v18 / v19; // unsigned divide (megacore fold)
v22 = OffloadFactory::IdxConst(a1+8, a2, fold); // 0x133e6ba0 (MLIR index constant)
v23 = OffloadFactory::DivU(a1+8, a2, CoreIndex, v22); // 0x133e6a60 → arith::DivUIOp
ordinal = v23;
global_core_id = OffloadFactory::ToGlobalCoreId(a1+8, a2, partition, v23); // 0x133e6880
// THE CALLBACK — call QWORD PTR [strategy + 0x98]
(*(a1 + 0x98))(&record /*StatusOr<ExplicitRingRecord>*/,
a1 + 0x88 /*captured table descriptor*/,
a1 + 0x08 /*OffloadFactory subobject*/,
a2 /*OpBuilder*/,
&global_core_id,
&ordinal);
if (record == OK) { // writeback (double-init guarded)
CHECK(strategy+0x58 == null, "ordinal_ == nullptr"); // line 250
*(a1 + 0x58) = record.ordinal; // ordinal_
CHECK(strategy+0x60 == null, "next_chip_ == nullptr"); // line 245
*(a1 + 0x60) = record.next_chip; // next_chip_
CHECK(strategy+0x78 == null, "reordering_map_ == nullptr"); // line 255
*(a1 + 0x78) = record.reorder; // reordering_map_
return OK;
}
return AddSourceLocationImpl(362); // line 362
The three writeback fields are named in the LogMessageFatal strings the decompile carries: ordinal_ at strategy+0x58, next_chip_ at +0x60, reordering_map_ at +0x78 — each guarded against a second init (a re-init dies fatal). next_chip_ is then consumed by next_chip() @0x1337a200 → GlobalCoreIdToPhysicalChipId → ComputeRemoteCoreIndex → the SyncAddOp the SC custom-barrier emits (Barrier → SFLAG Binding).
NOTE — line numbers. The
RetCheck/AddSourceLocationImplline constants (351,354,362) and the writebackCHECKlines (245/250/255) matchplatforms/xla/sparse_core/offload_collective_strategies.ccexactly (the immediates0x15f=351,0x162=354,0x16a=362). TheCoresPerChip(2)/LogicalDevicesPerChip(2)fold is the same megacore-fold divisor used byToPartnerGlobalCoreId.
2.2 The callback body — ExplicitRingRecord @0x133a9a40 (plain) / ExplicitAllToAllRingRecord @0x133a94a0 (a2a)
The strategy+0x98 slot is installed by the per-color factory closures inside CreateRingStrategiesForNdFromExplicitTable (emitter_helpers). The factory → strategy map (the demangled $_N lambdas confirm the record types):
| Factory | Builds | Callback slot | Record type | kind selector |
|---|---|---|---|---|
$_0 @0x133a9080 | D2DUniDirRingStrategy (sizeof 0x98) | — | — | — |
$_1 @0x133a91e0 | ExplicitUniDirAllToAllRingStrategy (sizeof 0xe8, vtable @0x21908ec8) | +0xb0 = 0x133a94a0 | ExplicitAllToAllRingRecord | 0x2 |
$_2 @0x133a9840 | plain ExplicitUniDirRingStrategy (sizeof 0xc0, vtable @0x21908e30) | +0x98 = 0x133a9a40 | ExplicitRingRecord | 0x4 |
NOTE — the factory map. The demangled symbols pin the three lambda slots:
$_0buildsD2DUniDirRingStrategy,$_1builds the all-to-all strategy (returningExplicitAllToAllRingRecord), and$_2builds the plain explicit ring (returningExplicitRingRecord). The$_2install at0x133a99d9writes the0x133a9a40lambda intostrategy+0x98— the exact call target of §2.1.
The plain lookup ExplicitRingRecord @0x133a9a40 body:
ctx = [strategy+0x88](the table descriptor) →r14:r14[0]= an inner const-literals accessor object (itsoperator()ataccessor+0x10);r14[0x10]=id_info_offset(movsxdint32);r14[0x18]= the table base ptr;r14[0x20]/r14[0x21]= a byte selector + a variant tag.- inner call: push
(kind = 0x4, base, byte-selector, &out, global_core_id, ordinal);call QWORD PTR [accessor+0x10]@0x133a9a96— the accessor that slices the const-literals int vector at[base + id_info_offset…]and materialises the per-core neighbor/id/group constant as MLIRValues. - on success:
next_chip = out[0],ordinal = out[8 or 0x10](ther14+0x21variant tag selects the field),reorder = out[0x20]→ written into theStatusOr<ExplicitRingRecord>(record.next_chip+0x8,record.ordinal+0x10,record.reorder+0x18,record.ok+0x0). Free the temporary vector.RetCheckline0xfe5.
The all-to-all twin ExplicitAllToAllRingRecord @0x133a94a0 is identical in shape but uses kind selector 0x2 and threads the result through OffloadFactory::location @0x131e9ca0 + Target::GranuleBytes @0x1d617f80 + BufferOffset::Create @0x133ea5c0 + BufferOffset::WithOffsetElements @0x133eace0 to convert the group-info table slice into a byte BufferOffset (the per-core group-info window). RetCheck line 0xfb9.
NOTE — the final hop. The callback is where the static EXPLICIT offsets become a programmed
Value:id_info_offset(table_desc+0x10) + the table base (+0x18) index the runtime const-literals vector, keyed by the runtimeglobal_core_id/ordinal, to produce thenext_chipthe per-ringSyncAddtargets. The plain ring returns rawValues; the a2a ring returns aBufferOffset-wrapped group window. The0x4vs0x2selector chooses which.
INFERRED — the inner accessor body. The accessor object reached via
call [accessor+0x10]has three proven captured members (+0x10=id_info_offset,+0x18= table base,+0x20/+0x21= byte + variant selectors) and a pinned call signature (global_core_id, ordinal, base, selector, kind 0x4/0x2 → out triple), but its own body — the closure that performs the spmem load /DivU-Modindexing into the const-literals vector — was not separately disassembled. Its CALL signature is CONFIRMED; the const-literals vector entry semantics are the standing producer gap (LOW).
3. The two subsystems, one chip SFLAG block
The TC assignment pass (§1) and the explicit-ring InitializeOnScs (§2) sit on opposite ends of the same on-chip barrier model: the TC pass chooses a BarrierConfig for dense TC collectives; InitializeOnScs binds the per-ring transfer geometry for SC embedding collectives. Both ultimately program chip SFLAGs (Barrier → SFLAG Binding).
| Aspect | TensorCore (§1) | SparseCore explicit ring (§2) |
|---|---|---|
| Pass / init | TensorCoreBarrierAssignment::Run @0x109c7420 | ExplicitUniDirRingStrategy::InitializeOnScs @0x1337aa60 |
| Identity unit | TensorCoreBarrierKey (opcode, channel parity, replica-group / src-tgt vectors, cc scalar) | static IciStrategyRingConfig EXPLICIT offsets (id_info / group_info) |
| Dedup / bind mechanism | greedy graph coloring (Barrier Coloring) → BarrierConfig {1,2,3} | strategy+0x98 const-literals lookup → (next_chip, ordinal, reorder) |
| Output destination | HLO BackendConfig.BarrierConfig submessage | strategy+0x58/+0x60/+0x78 (ordinal_ / next_chip_ / reordering_map_) |
| Hardware sink | chip SFLAG block via the kernel emitter | per-ring SyncAdd targeting next_chip_ on the chip SFLAG block |
The split is purely in how a barrier identity is chosen and bound: the TC pass colors the live async-collective conflict graph; the explicit ring indexes a static config table by the runtime core id. Both arms end at the chip SFLAG tier (overview §1).
4. Verification notes
Byte-exact in
libtpu.sov0.0.40:
DetermineBarrierConfigForKey@0x109c6fa0:BarrierConfig::BarrierConfig(this,0);IsGlobalBarrierBeneficial;force_global || beneficial→type=1,id=-1; saturation (key+0x10 == key+0x20 − 1) withconfig+0x50 != 1→type=2,id=replica_group_count; elsetype=3fresh +std::map __try_key_extraction_impl→ cache hitCopyFrom(node+0x80). Type at+0x20, id at+0x18, hasbits+0x10 |= 3. Nomovl $4— exact.ExplicitUniDirRingStrategy::InitializeOnScs@0x1337aa60:if (!config) RetCheckline 351; baseUniDirRingStrategy::InitializeOnScs→AddSourceLocationImpl(354);GetCoreIndex;CoresPerChip(2) / LogicalDevicesPerChip(2)unsigned-divide fold;IdxConst;DivU;ToGlobalCoreId;call [a1+0x98]with(&record, a1+0x88, a1+0x08, OpBuilder, &global_core_id, &ordinal); writeback*(a1+0x58)=ordinal_,*(a1+0x60)=next_chip_,*(a1+0x78)=reordering_map_, eachLogMessageFatal-guarded (lines 250/245/255);AddSourceLocationImpl(362)— exact.- Symbol confirmation:
TensorCoreBarrierAssignment::{Run, ForEachCollective, IsGlobalBarrierBeneficial, DetermineBarrierConfigForKey}andTensorCoreBarrierKeypresent with full demangled symbols;ExplicitUniDirRingStrategy::InitializeOnScsand theCreateRingStrategiesForNdFromExplicitTable$_1(→ExplicitAllToAllRingRecord) /$_2(→ExplicitRingRecord) lambdas present, confirming the factory→record map and the(OffloadFactory&, OpBuilder&, Value, Value) → StatusOr<…Record>callback signature.[LOW]
IsGlobalBarrierBeneficial'sconfig+0x170/config+0x178scalars are offset-confirmed and read as "single-device-along-axis", but the proto field names (num_partitionsvsreplica_count) are inferred, not descriptor-matched.- The inner const-literals accessor body (
call [accessor+0x10], §2.2) — its three captured members and call signature are pinned, but the closure that slices the const-literals vector and materialises the per-coreValues was not separately disassembled.- The
TensorCoreBarrierKeysub-member semantics (which vector is replica-groups vs src-tgt-pairs) are read from the__assign_with_size/__introsortsources and the opcode branch; the offsets are CONFIRMED, the role naming is structural.
Cross-References
Barrier algorithms (this section)
- Overview — the SFLAG-based barrier model, the
BarrierTypeenum, and theInferBarrierConfignormaliser (the secondBarrierConfigwriter this page's producer feeds) - Barrier Coloring — the greedy interference-graph engine whose merged conflict set is the
has_conflictinput toDetermineBarrierConfigForKey - Barrier → SFLAG Binding — how a chosen
BarrierConfig {type, id}(and the boundnext_chip_) becomes a concrete chip SFLAG number + signal/wait ops - Replica Barrier — the within-replica-group tree barrier (
REPLICA(2)), the shared-id outcome this pass can emit
Sibling subsystems
- Megacore Fusion — the megacore-fold collective fusion whose ring strategies
InitializeOnScsbinds - back to index
Binary: extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so (build-id 89edbbe81c5b328a958fe628a9f2207d).