Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

TensorCore Barrier Assignment and InitializeOnScs

Every address, field offset, opcode value, and enum value on this page was read from libtpu.so in the libtpu-0.0.40-cp314 wheel (build-id 89edbbe81c5b328a958fe628a9f2207d; build libtpu_lts_20260413_b_RC00; 781,691,048 B, not stripped — full C++ symbols). .text VMA equals file offset at 0xe63c000; .rodata / .lrodata are identity-mapped. All addresses are VMA. Other wheel versions differ.

Abstract

This page owns two byte-anchored pieces that bracket the TensorCore (TC) on-chip barrier path. The first is the TC barrier-assignment passTensorCoreBarrierAssignment::Run @0x109c7420 and its per-key kind selector DetermineBarrierConfigForKey @0x109c6fa0 — the TensorCore counterpart of the SparseCore coloring/assignment. It walks every TC collective, builds a TensorCoreBarrierKey per op, feeds the two greedy coloring passes' conflict set in, and assigns each distinct key a BarrierConfig {type, id} that is written back into the collective's HLO BackendConfig. The second is ExplicitUniDirRingStrategy::InitializeOnScs @0x1337aa60 and its strategy+0x98 lookup-callback (ExplicitRingRecord @0x133a9a40 for the plain ring, ExplicitAllToAllRingRecord @0x133a94a0 for the all-to-all twin): the SparseCore-side runtime init that turns the static explicit ring-config table into a programmed next_chip / ordinal / reorder triple bound onto the strategy object.

These are two distinct subsystems joined by the chip SFLAG model. The TC assignment pass decides a barrier kind per key (global vs per-key) and the InitializeOnScs callback binds the per-ring transfer geometry the SC sequencer's barrier sync_add targets — both ultimately program the same chip SFLAG block. The generic greedy coloring engine is owned by Barrier Coloring (this page consumes its conflict set, it does not re-derive it); the BarrierConfig.id → concrete SFLAG-number formulas are owned by Barrier → SFLAG Binding (this page produces the {type, id}, it does not lower it). The BarrierType enum, the InferBarrierConfig normaliser, and the reserved-block geometry are on the overview.

For reimplementation, the contract is:

  • TC dedups by graph color, SC by static-key hash. Run runs the two BarrierColoring passes (Barrier Coloring), merges their conflict sets, then per distinct TensorCoreBarrierKey calls DetermineBarrierConfigForKey(key, config, has_conflict). A non-conflicting key may share a REPLICA(2) or take a GLOBAL(1) barrier; a conflicting key (in the merged conflict set) is forced to a fresh CUSTOM(3) id.
  • DetermineBarrierConfigForKey writes only {1,2,3}. Type at message +0x20, id at +0x18, presence hasbits +0x10 |= 3. GLOBAL(1)id = -1; REPLICA(2) / CUSTOM(3)id = replica_group_count (key+0x10), with 2 chosen when a std::map<key, BarrierConfig> dedup finds the key already present. MEGACORE(4) is never written here.
  • IsGlobalBarrierBeneficial is a narrow heuristic. Global is "beneficial" only for a channelled all-to-all(0xc) whose tested topology axis is unit-sized and which spans a single replica group — the degenerate case where a custom per-key barrier would synchronise a trivial group.
  • InitializeOnScs is the static-config → hardware final hop. It folds the runtime core index by CoresPerChip / LogicalDevicesPerChip, then calls [strategy+0x98] with the captured table descriptor (strategy+0x88) and the runtime global_core_id / ordinal; the callback indexes the const-literals table and returns (next_chip, ordinal, reorder), written to strategy+0x58 (ordinal_) / +0x60 (next_chip_) / +0x78 (reordering_map_). next_chip_ is what the per-ring SyncAdd targets.
TC assignment passxla::jellyfish::TensorCoreBarrierAssignment::Run @0x109c7420
Per-key kind selectorDetermineBarrierConfigForKey @0x109c6fa0 (writes {1,2,3}; never 4)
Global heuristicIsGlobalBarrierBeneficial @0x109c6ee0
Collective filterForEachCollective @0x109c7060 (opcode bitmask 0x1400001340 + 0x56/0x5d/0x31)
KeyTensorCoreBarrierKey ctor @0x109d6200, operator< @0x109d6620
Coloring engine (sibling)BarrierColoring<…>::Run @0x109cf600 / 0x109d1a60Barrier Coloring
Config carrierBackendConfig.BarrierConfig {type @+0x20, id @+0x18, hasbits @+0x10}
Explicit-ring runtime initExplicitUniDirRingStrategy::InitializeOnScs @0x1337aa60
strategy+0x98 callback (plain)ExplicitRingRecord lambda @0x133a9a40 ($_2, kind selector 0x4)
strategy+0xb0 callback (a2a)ExplicitAllToAllRingRecord lambda @0x133a94a0 ($_1, kind selector 0x2)
Writebackstrategy+0x58 = ordinal_, +0x60 = next_chip_, +0x78 = reordering_map_
ConfidenceCONFIRMED (both bodies decompiled and re-checked) unless a row says otherwise

1. The TC barrier-assignment pass

1.1 Where Run sits — the SC/TC fork

BarrierAssignment::RunImpl @0x109c8c00 is the shared entry. It receives a flat_hash_set<string_view> of XLA thread-pool names and forks on which threads are present:

  • kTensorCoreThread @0x217f78d8 present → build a TensorCoreBarrierAssignment on the stack from the base config fields and call Run @0x109c8d6e.
  • kSparseCoreThread @0x217f5ca8 present → build a SparseCoreBarrierAssignment and call its Run @0x109c8e8e.

The two result hasbits OR together, so a module compiled for both a TensorCore thread and a SparseCore thread runs both passes. The on-chip barrier model is the union of the colored-map TC barriers and the static-keyed SC barriers; both program chip SFLAGs (overview §1), differing only in how a barrier identity is chosen.

BarrierAssignment::RunImpl @0x109c8c00
  ├─ if kTensorCoreThread present → TensorCoreBarrierAssignment::Run @0x109c7420   (THIS PAGE §1.2)
  │     coloring ×2 → conflict set → per-key DetermineBarrierConfigForKey → BackendConfig writeback
  └─ if kSparseCoreThread present → SparseCoreBarrierAssignment::Run @0x109c8e8e   (static ring-key dedup)
  result.changed = TC.changed | SC.changed

1.2 TensorCoreBarrierAssignment::Run @0x109c7420

Run is the TC counterpart of the SparseCore coloring/assignment. Its five phases:

  1. Build the policy base. Construct an AsyncOpPolicy<TensorCoreBarrierKey> (vtable @0x217f7a68) carrying the megacore byte (this+0x18), the core-id std::function<int(HloInstruction*)> (this+0x58), and two policy bools (this+0x60, this+0x68). Re-vtable it to AsyncCollectivePermutePolicy<TensorCoreBarrierKey> (vtable @0x217f7ac8).
  2. First coloring pass. BarrierColoring<AsyncCollectivePermutePolicy<…>>::Run @0x109cf600StatusOr<pair<map<HloInstruction*,long>, flat_hash_set<HloInstruction*>>>: a per-op color and a conflict-op set. Merge the colored ops into a std::map<HloInstruction*,long> (frame -0x158) and the conflict ops into a FlatHashSet (frame -0x140).
  3. Second coloring pass. Re-vtable the policy to AsyncAllToAllWithAsyncBarrierPolicy<TensorCoreBarrierKey> (vtable @0x217f7b30) and run BarrierColoring<…>::Run @0x109d1a60; merge its coloring + conflict set into the same -0x158 map / -0x140 set.
  4. Collect collectives. ForEachCollective @0x109c7060 (§1.4) walks every TC-thread collective, builds each TensorCoreBarrierKey, and appends (key, InlinedVector<HloInstruction*, 2>) pairs into a vector (entry stride 0x78). std::__stable_sort with the Run::$_2 comparator @0x109ccdc0 makes the iteration order deterministic.
  5. Per-key assign loop (@0x109c7ed0 … 0x109c8250, r12 = entry, stride 0x78):
    • look up the entry's first instruction in the merged conflict FlatHashSet (raw_hash_set::find @0x109cf200) → has_conflict;
    • config = HloModule->config (module+0x20);
    • DetermineBarrierConfigForKey(&result, key, config, has_conflict) @0x109c6fa0 (§1.3) → a BarrierConfig;
    • ragged-all-to-all special-case @0x109c7f89: if opcode == ragged-all-to-all(0x56) AND the resulting kind != 3 (not a fresh per-key id) AND the policy bool this+0x68 is set → divert to the a2a async-barrier special handling (jump 0x109c8371);
    • for every HloInstruction in the entry's InlinedVector (each colored op sharing this key): open its BackendConfig (BackendConfigWrapper::GetProto @0x1e60dc60), set its BarrierConfig sub-message = the DetermineBarrierConfigForKey result (BarrierConfig::CopyFrom @0x1d6f1fc0), set the BackendConfig hasbit (or BYTE [-0x600], 0x10), and re-serialise via CloneBackendConfigProto @0x1e60dac0 + BackendConfigWrapper::operator= @0x1e60de40.

NOTE — dedup by color, not by key hash. Ops the coloring engine proves can share a barrier (non-conflicting in the async dependency graph) get the same color and can share a barrier id; ops with the same TensorCoreBarrierKey but a graph conflict are split apart. This is the structural difference from the SparseCore FlatHashMap dedup, which keys purely on the static ring-config and has no notion of schedule overlap. The interference-graph construction and first-fit color search are owned by Barrier Coloring; this page only consumes the conflict set as has_conflict.

1.3 DetermineBarrierConfigForKey @0x109c6fa0 — the kind selector

Effective signature: BarrierConfig DetermineBarrierConfigForKey(const TensorCoreBarrierKey& key, const HloModuleConfig& config, bool force_global). The decompiled body is short and pins every store:

// 0x109c6fa0  (this = &result BarrierConfig; a2 = key; a3 = config; a5 = has_conflict/force_global)
BarrierConfig::BarrierConfig(this, 0);                       // zero-init result
beneficial = IsGlobalBarrierBeneficial(key, config);         // 0x109c6ee0 (§1.5)

if (force_global || beneficial) {                            // → GLOBAL
LABEL_global:
    *(int  *)(this + 0x20) = 1;     // type  = GLOBAL(1)
    *(long *)(this + 0x18) = -1;    // id    = -1 (none)
    *(char *)(this + 0x10) |= 3;    // hasbits: type+id present
    return this;
}

v9 = *(long*)(key + 0x10);          // replica_group_count
if (v9 == *(long*)(key + 0x20) - 1) {           // spans all-but-one group (num_groups - 1)
    if (*(char*)(config + 0x50) != 1) {          // "use global on saturation" flag NOT set
        *(int *)(this + 0x20) = 2;  // type = REPLICA(2)  (shared)
        *(long*)(this + 0x18) = v9; // id   = replica_group_count
        *(char*)(this + 0x10) |= 3;
        return this;
    }
    goto LABEL_global;              // saturation + flag set → GLOBAL(1)
}

// ELSE: provisional fresh per-key
*(int *)(this + 0x20) = 3;          // type = CUSTOM(3)  (fresh)
*(long*)(this + 0x18) = v9;         // id   = replica_group_count (provisional)
*(char*)(this + 0x10) |= 3;
v11 = std::map<key,BarrierConfig>::__try_key_extraction_impl(key, key, config, this);  // 0x109cbd60
if ((dl & 1) == 0)                  // emplace did NOT insert ⇒ key already present (cache HIT)
    BarrierConfig::CopyFrom(this, (BarrierConfig*)(v11 + 0x80));   // adopt the existing config (its type==2)
return this;                        // cache MISS ⇒ keep the fresh type-3 config

The numeric layout (this+0x20 is the DWORD type at byte offset 0x20; this+0x18 is the QWORD id; this+0x10 is the hasbit byte):

Outcometype @+0x20id @+0x18when
GLOBAL1-1force_global OR IsGlobalBarrierBeneficial OR (saturation AND config+0x50 == 1)
REPLICA (shared)2replica_group_countsaturation (rgc == num_groups − 1) AND config+0x50 != 1; or map dedup found the key present
CUSTOM (fresh)3replica_group_countotherwise, on a std::map cache miss

GOTCHA — the saturation arm. The saturation case (replica_group_count == num_groups − 1) does not simply produce GLOBAL. It produces a shared REPLICA(2) (id = the real replica_group_count) unless the module-config flag config+0x50 == 1, in which case it falls to GLOBAL(1). The 2 write (*(int*)(this+0x20) = 2) at 0x109c704d is reached both from the saturation arm and from the std::map cache-hit CopyFrom. MEGACORE(4) is never written — there is no movl $4 into +0x20 in this function (full-.text xref; overview §2).

1.4 ForEachCollective @0x109c7060 — which ops are TC collectives

ForEachCollective iterates HloModule::computations(kTensorCoreThread) @0x10944b40, walks each instruction, and invokes the callback for each op whose opcode byte (instr+0xc) is in the collective set. The filter (@0x109c72b5 … 0x109c736d):

  • a bitmask bt 0x1400001340, opcode over opcode <= 0x24 — set bits: all-gather(0x06), all-gather-start(0x08), all-reduce(0x09), all-to-all(0x0c), collective-permute(0x22), collective-permute-start(0x24);
  • cmp 0x56ragged-all-to-all accepted;
  • cmp 0x5dreduce-scatter accepted;
  • otherwise (custom-call(0x31)): GetCustomCallCollectiveId @0x109d6540 reads the CustomCallConfig backend-config field (hasbit 0x40, returns +0x78) → accepted iff it carries a TPU collective id.

The callback also threads the conflict graph through async-start/async-done pairs by recursing into operand positions (instr+0x40 / instr+0x48).

bytemnemonicfilter mechanismbarrier-key role
0x06all-gatherbitmaskreplica-group keyed
0x08all-gather-startbitmaskreplica-group keyed (async start)
0x09all-reducebitmaskreplica-group keyed
0x0call-to-allbitmaskreplica-group keyed; only opcode eligible for IsGlobalBarrierBeneficial
0x22collective-permutebitmasksource-target-pair keyed (+0x20)
0x24collective-permute-startbitmasksource-target-pair keyed (async start)
0x31custom-callGetCustomCallCollectiveIdaccepted iff it has a TPU collective id
0x56ragged-all-to-alleq-checkreplica-group keyed; +0x51/+0x58 special-case
0x5dreduce-scattereq-checkreplica-group keyed (also the a2a-5d rewrite target)

NOTE — opcode→mnemonic map. The map is recovered from the HloOpcodeString length array @0x421c9c0 (lengths 10/16/10/10/18/24/11/17/14 match all-gather / all-gather-start / all-reduce / all-to-all / collective-permute / collective-permute-start / custom-call / ragged-all-to-all / reduce-scatter) plus .lrodata string presence; the char* table itself is relocated, so values are resolved by length + presence.

1.5 IsGlobalBarrierBeneficial @0x109c6ee0 — the heuristic

Byte-exact body:

// 0x109c6ee0  (key = a1; config = a2)  → bool
if (*(char*)(key + 0x51) == 0)                  // "barrier-heuristic candidate" byte not set
    return false;                               // (VLOG(10) diagnostic path)
if (*(byte*)(key + 0x00) != 0x0c)               // opcode != all-to-all
    return false;
if (*(char*)(key + 0x50) != 0)                  // channel-id parity (cross-module) != 0
    return false;
if (*(long*)(config + 0x170) == 1 ||            // num_partitions / replica_count along axis == 1 (cmpq)
    *(long*)(config + 0x178) == 1)
    return (*(long*)(key + 0x10) == 1);         // replica_group_count == 1 ⇒ single group
return false;

A global barrier is "beneficial" exactly for a single-replica-group, channelled all-to-all on a topology axis whose tested partition/replica count is 1 — the degenerate all-to-all where a per-key barrier would synchronise a trivial group and a single shared global barrier is strictly cheaper.

GOTCHA — config+0x170 / config+0x178. The two == 1 tests are 64-bit cmpq compares (cmpq $0x1,0x170(%rdx) / cmpq $0x1,0x178(%rdx) @0x109c6ef3/0x109c6efd), byte-confirmed at those offsets and read as "single-device-along-axis" from use, but their proto field names (num_partitions vs replica_count vs a derived device count of HloModuleConfig) are inferred, not field-matched against the descriptor. The behaviour (==1 on either ⇒ test replica_group_count == 1, itself a cmpq $0x1,0x10(%rsi)) is CERTAIN; the field naming is LOW.

1.6 TensorCoreBarrierKey ctor @0x109d6200 / operator< @0x109d6620

The key is what two collectives must agree on to be candidates for sharing a barrier (the coloring then decides whether they actually may). Ctor TensorCoreBarrierKey(HloInstruction* hlo, function<int(HloInstruction*)> core_id_fn, bool b):

OffsetFieldSource
+0x00opcode bytehlo+0xc; all-reduce(0x09) is rewritten to reduce-scatter(0x5d) when its computation has ≥2 caller instructions (cmpb $0x9 @0x109d6297caller_instructions() ≥ 2 → movb $0x5d @0x109d62db)
+0x08…+0x18sorted vector<ReplicaGroup>copied from hlo+0xd0/hlo+0xd8 (__assign_with_size @0x109d6f60), __introsort'd @0x109d7580 (order-independent); +0x10 = group count
+0x20…+0x28sorted vector<pair<long,long>>source-target pairs (for collective-permute 0x22/0x24): __assign_with_size @0x109d7220 + __introsort @0x109da4e0
+0x38custom-call config scalarall-to-all(0xc) custom-calls: reads CustomCallConfig+0x78 when hasbit 0x40 set; also sets +0x50 true
+0x40core-id (int)-1 default, else core_id_fn(hlo) (invoked via [fn+0x10] when the function target is non-null)
+0x50channel-id parityhlo->channel_id() & 1 (HloInstruction::channel_id @0x1e59ff80) — the byte IsGlobalBarrierBeneficial gates on
+0x51heuristic-candidate flagthe byte IsGlobalBarrierBeneficial gates on (cmpb $0x0,0x51(%rdi) @0x109c6ee0); not written by this ctor — the ctor only zero-touches +0x08…+0x37, +0x38, +0x40, +0x48, +0x50, +0x58, so the candidate byte is set on a different path (standing gap)
+0x58secondary discriminant-1 sentinel written only when the ctor's bool b is set AND opcode == ragged-all-to-all(0x56) (test %r15b,%r15b @0x109d64ddcmpb $0x56,0xc(%r14) @0x109d64e2movq $-1,0x58(%rbx) @0x109d64e9); the ragged-all-to-all global-disable case

operator< @0x109d6620 compares in order: +0x58, +0x38, +0x00 (opcode), +0x50 (channel parity), then the replica-group vector (+0x10 count, then per-group sorted device ids). Two collectives share a TC barrier key iff same opcode, same channel parity, same custom-call scalar, same secondary discriminant, and identical sorted replica-group membership — the dense-collective analog of the SparseCore ring key, keyed on HLO replica groups rather than ICI ring offsets.


2. ExplicitUniDirRingStrategy::InitializeOnScs and the strategy+0x98 callback

This is a different subsystem from §1: the SparseCore-side ring strategy runtime init. Where the TC pass decides a barrier kind, InitializeOnScs binds the per-ring transfer geometry the SC sequencer's barrier sync_add will target. It is the final hop of the static IciStrategyRingConfig EXPLICIT-table → programmed next_chip bridge.

2.1 InitializeOnScs @0x1337aa60 — the fold + callback dispatch

System-V args: this = strategy, OpBuilder&, LocationGenerator, then three Values (scs ordinal, scs partition, a third Value triple). The decompiled body confirms every step:

// 0x1337aa60  (a1 = strategy; a2 = OpBuilder; a5 = scs partition Value; a3 = scs ordinal Value)
if (!a5)                                                         // line 351
    return RetCheckFailSlowPath("config != nullptr");

UniDirRingStrategy::InitializeOnScs(a1, a2, a3, a4, ...);        // base init; stores +0x68/+0x70
if (base != OK) return AddSourceLocationImpl(354);

CoreIndex = OffloadFactory::GetCoreIndex(a1+8, a2, a4);          // 0x133e6aa0 (per-replica core index)
v17  = *(a1+8);                                                  // OffloadFactory / Target subobject
v18  = Target::CoresPerChip(v17, /*SC*/2);                       // 0x1d615b40
v19  = Target::LogicalDevicesPerChip(v17, /*SC*/2);             // 0x1d615b00
fold = v18 / v19;                                                // unsigned divide (megacore fold)
v22  = OffloadFactory::IdxConst(a1+8, a2, fold);                 // 0x133e6ba0 (MLIR index constant)
v23  = OffloadFactory::DivU(a1+8, a2, CoreIndex, v22);           // 0x133e6a60 → arith::DivUIOp
ordinal       = v23;
global_core_id = OffloadFactory::ToGlobalCoreId(a1+8, a2, partition, v23);   // 0x133e6880

// THE CALLBACK — call QWORD PTR [strategy + 0x98]
(*(a1 + 0x98))(&record /*StatusOr<ExplicitRingRecord>*/,
               a1 + 0x88 /*captured table descriptor*/,
               a1 + 0x08 /*OffloadFactory subobject*/,
               a2        /*OpBuilder*/,
               &global_core_id,
               &ordinal);

if (record == OK) {                                              // writeback (double-init guarded)
    CHECK(strategy+0x58 == null, "ordinal_ == nullptr");         // line 250
    *(a1 + 0x58) = record.ordinal;                               // ordinal_
    CHECK(strategy+0x60 == null, "next_chip_ == nullptr");       // line 245
    *(a1 + 0x60) = record.next_chip;                             // next_chip_
    CHECK(strategy+0x78 == null, "reordering_map_ == nullptr");  // line 255
    *(a1 + 0x78) = record.reorder;                               // reordering_map_
    return OK;
}
return AddSourceLocationImpl(362);                               // line 362

The three writeback fields are named in the LogMessageFatal strings the decompile carries: ordinal_ at strategy+0x58, next_chip_ at +0x60, reordering_map_ at +0x78 — each guarded against a second init (a re-init dies fatal). next_chip_ is then consumed by next_chip() @0x1337a200GlobalCoreIdToPhysicalChipIdComputeRemoteCoreIndex → the SyncAddOp the SC custom-barrier emits (Barrier → SFLAG Binding).

NOTE — line numbers. The RetCheck / AddSourceLocationImpl line constants (351, 354, 362) and the writeback CHECK lines (245/250/255) match platforms/xla/sparse_core/offload_collective_strategies.cc exactly (the immediates 0x15f=351, 0x162=354, 0x16a=362). The CoresPerChip(2) / LogicalDevicesPerChip(2) fold is the same megacore-fold divisor used by ToPartnerGlobalCoreId.

2.2 The callback body — ExplicitRingRecord @0x133a9a40 (plain) / ExplicitAllToAllRingRecord @0x133a94a0 (a2a)

The strategy+0x98 slot is installed by the per-color factory closures inside CreateRingStrategiesForNdFromExplicitTable (emitter_helpers). The factory → strategy map (the demangled $_N lambdas confirm the record types):

FactoryBuildsCallback slotRecord typekind selector
$_0 @0x133a9080D2DUniDirRingStrategy (sizeof 0x98)
$_1 @0x133a91e0ExplicitUniDirAllToAllRingStrategy (sizeof 0xe8, vtable @0x21908ec8)+0xb0 = 0x133a94a0ExplicitAllToAllRingRecord0x2
$_2 @0x133a9840plain ExplicitUniDirRingStrategy (sizeof 0xc0, vtable @0x21908e30)+0x98 = 0x133a9a40ExplicitRingRecord0x4

NOTE — the factory map. The demangled symbols pin the three lambda slots: $_0 builds D2DUniDirRingStrategy, $_1 builds the all-to-all strategy (returning ExplicitAllToAllRingRecord), and $_2 builds the plain explicit ring (returning ExplicitRingRecord). The $_2 install at 0x133a99d9 writes the 0x133a9a40 lambda into strategy+0x98 — the exact call target of §2.1.

The plain lookup ExplicitRingRecord @0x133a9a40 body:

  • ctx = [strategy+0x88] (the table descriptor) → r14: r14[0] = an inner const-literals accessor object (its operator() at accessor+0x10); r14[0x10] = id_info_offset (movsxd int32); r14[0x18] = the table base ptr; r14[0x20]/r14[0x21] = a byte selector + a variant tag.
  • inner call: push (kind = 0x4, base, byte-selector, &out, global_core_id, ordinal); call QWORD PTR [accessor+0x10] @0x133a9a96 — the accessor that slices the const-literals int vector at [base + id_info_offset…] and materialises the per-core neighbor/id/group constant as MLIR Values.
  • on success: next_chip = out[0], ordinal = out[8 or 0x10] (the r14+0x21 variant tag selects the field), reorder = out[0x20] → written into the StatusOr<ExplicitRingRecord> (record.next_chip +0x8, record.ordinal +0x10, record.reorder +0x18, record.ok +0x0). Free the temporary vector. RetCheck line 0xfe5.

The all-to-all twin ExplicitAllToAllRingRecord @0x133a94a0 is identical in shape but uses kind selector 0x2 and threads the result through OffloadFactory::location @0x131e9ca0 + Target::GranuleBytes @0x1d617f80 + BufferOffset::Create @0x133ea5c0 + BufferOffset::WithOffsetElements @0x133eace0 to convert the group-info table slice into a byte BufferOffset (the per-core group-info window). RetCheck line 0xfb9.

NOTE — the final hop. The callback is where the static EXPLICIT offsets become a programmed Value: id_info_offset (table_desc+0x10) + the table base (+0x18) index the runtime const-literals vector, keyed by the runtime global_core_id / ordinal, to produce the next_chip the per-ring SyncAdd targets. The plain ring returns raw Values; the a2a ring returns a BufferOffset-wrapped group window. The 0x4 vs 0x2 selector chooses which.

INFERRED — the inner accessor body. The accessor object reached via call [accessor+0x10] has three proven captured members (+0x10 = id_info_offset, +0x18 = table base, +0x20/+0x21 = byte + variant selectors) and a pinned call signature (global_core_id, ordinal, base, selector, kind 0x4/0x2 → out triple), but its own body — the closure that performs the spmem load / DivU-Mod indexing into the const-literals vector — was not separately disassembled. Its CALL signature is CONFIRMED; the const-literals vector entry semantics are the standing producer gap (LOW).


3. The two subsystems, one chip SFLAG block

The TC assignment pass (§1) and the explicit-ring InitializeOnScs (§2) sit on opposite ends of the same on-chip barrier model: the TC pass chooses a BarrierConfig for dense TC collectives; InitializeOnScs binds the per-ring transfer geometry for SC embedding collectives. Both ultimately program chip SFLAGs (Barrier → SFLAG Binding).

AspectTensorCore (§1)SparseCore explicit ring (§2)
Pass / initTensorCoreBarrierAssignment::Run @0x109c7420ExplicitUniDirRingStrategy::InitializeOnScs @0x1337aa60
Identity unitTensorCoreBarrierKey (opcode, channel parity, replica-group / src-tgt vectors, cc scalar)static IciStrategyRingConfig EXPLICIT offsets (id_info / group_info)
Dedup / bind mechanismgreedy graph coloring (Barrier Coloring) → BarrierConfig {1,2,3}strategy+0x98 const-literals lookup → (next_chip, ordinal, reorder)
Output destinationHLO BackendConfig.BarrierConfig submessagestrategy+0x58/+0x60/+0x78 (ordinal_ / next_chip_ / reordering_map_)
Hardware sinkchip SFLAG block via the kernel emitterper-ring SyncAdd targeting next_chip_ on the chip SFLAG block

The split is purely in how a barrier identity is chosen and bound: the TC pass colors the live async-collective conflict graph; the explicit ring indexes a static config table by the runtime core id. Both arms end at the chip SFLAG tier (overview §1).


4. Verification notes

Byte-exact in libtpu.so v0.0.40:

  • DetermineBarrierConfigForKey @0x109c6fa0: BarrierConfig::BarrierConfig(this,0); IsGlobalBarrierBeneficial; force_global || beneficialtype=1, id=-1; saturation (key+0x10 == key+0x20 − 1) with config+0x50 != 1type=2, id=replica_group_count; else type=3 fresh + std::map __try_key_extraction_impl → cache hit CopyFrom(node+0x80). Type at +0x20, id at +0x18, hasbits +0x10 |= 3. No movl $4 — exact.
  • ExplicitUniDirRingStrategy::InitializeOnScs @0x1337aa60: if (!config) RetCheck line 351; base UniDirRingStrategy::InitializeOnScsAddSourceLocationImpl(354); GetCoreIndex; CoresPerChip(2) / LogicalDevicesPerChip(2) unsigned-divide fold; IdxConst; DivU; ToGlobalCoreId; call [a1+0x98] with (&record, a1+0x88, a1+0x08, OpBuilder, &global_core_id, &ordinal); writeback *(a1+0x58)=ordinal_, *(a1+0x60)=next_chip_, *(a1+0x78)=reordering_map_, each LogMessageFatal-guarded (lines 250/245/255); AddSourceLocationImpl(362) — exact.
  • Symbol confirmation: TensorCoreBarrierAssignment::{Run, ForEachCollective, IsGlobalBarrierBeneficial, DetermineBarrierConfigForKey} and TensorCoreBarrierKey present with full demangled symbols; ExplicitUniDirRingStrategy::InitializeOnScs and the CreateRingStrategiesForNdFromExplicitTable $_1 (→ ExplicitAllToAllRingRecord) / $_2 (→ ExplicitRingRecord) lambdas present, confirming the factory→record map and the (OffloadFactory&, OpBuilder&, Value, Value) → StatusOr<…Record> callback signature.

[LOW]

  • IsGlobalBarrierBeneficial's config+0x170 / config+0x178 scalars are offset-confirmed and read as "single-device-along-axis", but the proto field names (num_partitions vs replica_count) are inferred, not descriptor-matched.
  • The inner const-literals accessor body (call [accessor+0x10], §2.2) — its three captured members and call signature are pinned, but the closure that slices the const-literals vector and materialises the per-core Values was not separately disassembled.
  • The TensorCoreBarrierKey sub-member semantics (which vector is replica-groups vs src-tgt-pairs) are read from the __assign_with_size / __introsort sources and the opcode branch; the offsets are CONFIRMED, the role naming is structural.

Cross-References

Barrier algorithms (this section)

  • Overview — the SFLAG-based barrier model, the BarrierType enum, and the InferBarrierConfig normaliser (the second BarrierConfig writer this page's producer feeds)
  • Barrier Coloring — the greedy interference-graph engine whose merged conflict set is the has_conflict input to DetermineBarrierConfigForKey
  • Barrier → SFLAG Binding — how a chosen BarrierConfig {type, id} (and the bound next_chip_) becomes a concrete chip SFLAG number + signal/wait ops
  • Replica Barrier — the within-replica-group tree barrier (REPLICA(2)), the shared-id outcome this pass can emit

Sibling subsystems

Binary: extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so (build-id 89edbbe81c5b328a958fe628a9f2207d).