Megacore Collective Fusion
All addresses, symbols, and offsets on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, not stripped,.textVMA == file offset, buildlibtpu_lts_20260413_b_RC00). Other versions will differ; treat every VA as version-pinned.
Abstract
A megacore chip carries two TensorCores per logical device that the compiler folds into a single XLA device. A collective issued by that logical device must therefore be split across the two physical cores (the even/odd pair) and then re-synchronized so the two halves agree before the result leaves the chip. This page owns the three pieces of that fold a reimplementer cannot recover from the op surface or from the per-axis ring schedule:
- The
MEGACORE(4)marker is never materialized.BarrierTypehas five arms but onlyGLOBAL(1)/REPLICA(2)/CUSTOM(3)are ever written into aBarrierConfig;BARRIER_INVALID(0)is the all-zeros default.MEGACORE(4)is never produced — no compiler pass writesbarrier_type == 4, and the type-4 arm ofnet_util::GetBarrierSyncFlag(0x1c69ad00) is structurally dead. The actual TensorCore megacore barrier is a runtime construct —LloRegionBuilder::BarrierMegacore(0x1d5222a0) — gated byTpuChipConfig::Megacore(), completely outside theBarrierConfig → GetBarrierSyncFlagpath. - The two-core fold.
BarrierMegacoreand its program-descriptor twinnet_util::SynchronizeProgramDescriptorStatesMegacore(0x1c697540) each build aFingerprint2011-derived barrier id, then run aCoreIndex → SmodU32(core_count=2) → SaddS32even/odd sync-add on the reserved megacore SFLAG slot (Target::GetMegacoreBarrierSyncFlagNumber0x1d60f4e0, address label"megacore barrier sync flag"), withCHECK(core_count == 2)/CHECK(TensorCoresPerLogicalDevice() == 2)enforcing the even/odd pair. - The TC SFLAG block init.
Target::Init(0x1d60fc20) carves the megacore / AllReduce / Global accessors out of the chip's special-purpose SFLAG range:base = arr[0] → Target+0x8c0,count = arr.size() − 5 → Target+0x8c4. The−5reserves the top five slots for the named runtime accessors, of whichbase+count+0is the megacore slot.
The even/odd twist topology (which physical core is "primary" vs "secondary", how the two cores map onto the torus) is on Megacore Even/Odd Split; the SFLAG handshake protocol (VsyncAdd / VwaitGeSV semantics, the full BarrierType taxonomy) is in Barriers / TensorCore Barrier / SpecialPurposeSyncFlags. This page concentrates on the MEGACORE(4) marker, the two-core collective fold, and the megacore SFLAG slot init.
| Megacore TC barrier (runtime fold) | LloRegionBuilder::BarrierMegacore — 0x1d5222a0 |
| Program-descriptor state fold | net_util::SynchronizeProgramDescriptorStatesMegacore — 0x1c697540 |
| Megacore SFLAG-slot accessor | Target::GetMegacoreBarrierSyncFlagNumber — 0x1d60f4e0 (base+count) |
| SFLAG-block filler | Target::Init — 0x1d60fc20 (+0x8c0 = arr[0], +0x8c4 = size − 5) |
| Special-purpose SFLAG source | TpuChipConfig::GetSpecialPurposeSyncFlags(TpuCoreType) — 0x20afcf40 |
| The dead type-4 arm | net_util::GetBarrierSyncFlag — 0x1c69ad00 |
| Megacore mode gate | TpuChipConfig::Megacore(); TpuChipConfig+0x7c ≥ 2; CoresPerChip(TC) == 2 |
| Source TUs | platforms/xla/service/jellyfish/target.cc, .../llo_region_builder.cc, .../lowering/net_util.cc |
1. The MEGACORE(4) Marker Is Never a BarrierConfig
BarrierType is a 5-valued enum, stored as BarrierConfig.barrier_type at struct offset +0x20 (field 1). The decisive structural result of this fold: only three of the five arms are ever serialized into a BarrierConfig.
BarrierType | Produced as a BarrierConfig? | Realized as | SFLAG slot |
|---|---|---|---|
BARRIER_INVALID(0) | default (all-zeros, never written) | GetBarrierSyncFlag CHECK-fail target | — (illegal) |
GLOBAL(1) | yes — DetermineBarrierConfigForKey / InferBarrierConfig | GetGlobalBarrierSyncFlagNumber | base+count+4 (TC) |
REPLICA(2) | yes — same producers | per-id window | base + id (id < count) |
CUSTOM(3) | yes — same producers | per-id window | base + id (id < count) |
MEGACORE(4) | NO — never written anywhere | runtime BarrierMegacore (not a config) | base+count (separate accessor) |
The two functions that write BarrierConfig.barrier_type — DetermineBarrierConfigForKey (0x109c6fa0, writes 1 / 2 / 3 only) and the normalizer InferBarrierConfig (0x1376c240, rewrites CUSTOM→GLOBAL on a singleton channel, can set REPLICA, never 4) — both confirmed to never emit 4. There is no megacore-specific assignment pass. The MEGACORE enumerator is an enum value that is never materialized into a config.
1.1 The dead type-4 arm of GetBarrierSyncFlag
net_util::GetBarrierSyncFlag (0x1c69ad00) is the lowering of a BarrierConfig to an SFLAG number. The decompile shows barrier_type (a1+0x20, i.e. *(int*)(config+32)) drives exactly two branches plus a CHECK — there is no per-type split for 2 vs 3 vs 4:
// net_util::GetBarrierSyncFlag — 0x1c69ad00 (decompiled, condensed)
v3 = *(_DWORD *)(a1 + 32); // barrier_type (field 1, offset 0x20)
if (v3 == 1) { // GLOBAL
n = Target::GetGlobalBarrierSyncFlagNumber(target); // base+count+4
return SflagImmPtr(n, "global barrier sync flag", 24);
}
if (!v3) // BARRIER_INVALID
CHECK("barrier.barrier_type() != BarrierType::BARRIER_INVALID"); // net_util.cc:2065
// ---- ALL OF {2,3,4} share this one arm: ----
v4 = *(_QWORD *)(a1 + 24); // barrier.id()
CHECK(v4 < *(int*)(target + 2244)); // id < GetBarrierSyncFlagCount() (target+0x8c4) net_util.cc:2070
n = v4 + *(_DWORD *)(target + 2240); // base(+0x8c0) + id
return SflagImmPtr(n, "barrier sync flag number", 24);
2240 = 0x8c0 (base), 2244 = 0x8c4 (count). REPLICA(2), CUSTOM(3), and the unreachable MEGACORE(4) would all resolve identically to base + id (the per-id window), bound-checked id < count. There is no byte that distinguishes them — the per-type semantics live entirely upstream in which emitter runs. Since neither producer ever writes 4, this arm is dead for type 4, and — critically — even if a type-4 config were synthesized it would resolve to base + id, not the reserved base+count megacore slot. The megacore barrier deliberately bypasses BarrierConfig for exactly this reason.
NOTE — the full per-codename
BarrierConfigproducer/consumer taxonomy (theDetermineBarrierConfigForKeycoloring pass, theInferBarrierConfignormalizer, the GLOBALbase+count+4slot, the per-id window) lives in Barriers, InferBarrierConfig, and SpecialPurposeSyncFlags. This page only needs the result: type 4 is not a config.
2. The TensorCore Megacore Barrier (BarrierMegacore)
The actual TensorCore megacore barrier is emitted directly by the LLO region builder. LloRegionBuilder::BarrierMegacore (0x1d5222a0) folds the two physical TensorCores of one logical device on the reserved megacore SFLAG slot.
2.1 The megacore-mode gate
The function is a no-op unless all three of the megacore conditions hold (decompiled):
// BarrierMegacore — 0x1d5222a0 (gate, decompiled)
v8 = target; // region()->...->target
if (TpuChipConfig::Megacore(chip_config(target))) { // Megacore() mode
if (*(int*)(chip_config + 124) >= 2) { // TpuChipConfig+0x7c >= 2 (cores per megachip)
v9 = Target::CoresPerChip(target, 0); // TC cores
if ((int)v9 >= 2) {
CHECK(!region()->InPrimaryOrSecondaryRegion()); // llo_region_builder.cc:1751
... // fold body
}
}
}
// otherwise: no barrier emitted
TpuChipConfig::Megacore()— the megacore-mode discriminant (offset pathtarget[+0x3b8] → +0x18 → Megacore()).TpuChipConfig+0x7c ≥ 2— the cores-per-megachip field (offset confirmed; the≥ 2test is the megacore condition).CoresPerChip(kTensorCore) == 2— re-checked asCHECK(core_count == 2)atllo_region_builder.cc:1772(scalar path) and:1788(vector path); the function aborts if a megacore chip ever reports anything other than exactly two TensorCores. This is the even/odd pair invariant.CHECK(!region()->InPrimaryOrSecondaryRegion())(llo_region_builder.cc:1751, VLOG"BarrierMegacore in primary or secondary region, hlo: ") — the barrier may not be emitted inside one of the two per-core sub-regions; it is emitted at the fold point that joins them.
2.2 The fingerprint barrier id
Unlike a BarrierConfig (which carries a colored id), the megacore barrier id is a per-instruction fingerprint built at emit time — stable across the two cores so both sides compute the same SFLAG cookie:
// BarrierMegacore — 0x1d5222a0 (id construction, decompiled)
v19 = Fingerprint2011(operand_ptr, ...); // hash of the region's operand
if (hlo) { // HloInstruction*
FormatPack(&buf, "%d", /*one arg=*/2); // "2" -> a literal discriminator
v24 = Fingerprint2011(buf, ...);
v19 = FingerprintCat2011(v19, v24);
HloInstruction::unique_id(hlo); // folded into the str-format args
}
// per-module monotone cookie:
v27 = *(int*)(module + 68); // cookie_ordinal_
CHECK(v27 < INT32_MAX, "cookie_ordinal_ < std::numeric_limits<int32_t>::max()"); // llo_module.h:147
*(int*)(module + 68) = v27 + 1; // post-increment
v28 = FingerprintCat2011(v19, v27);
cookie = (uint32_t)(v28 ^ (v28 >> 32)); // fold to 32 bits
// annotation: "BarrierMegacore cookie: %u @ %s:%d"
The cookie is annotated "BarrierMegacore cookie: %u @ %s:%d" (with the absl::SourceLocation file/line) so the two cores' barriers are recognizable as a matched pair in dumps. The barrier id therefore comes from HloInstruction::unique_id() + a per-module cookie_ordinal_, not from a BarrierConfig.id().
2.3 The even/odd sync-add fold
The fold itself: load the megacore SFLAG slot, then per-partner emit a remote sync-flag add to the other core, and wait for the partner's bump.
// BarrierMegacore — 0x1d5222a0 (fold, core_count == 2, decompiled & condensed)
chip = LloRegionBuilder::ChipId(this, 0);
core = LloRegionBuilder::CoreIndex(this); // this core's index (even/odd)
n = Target::GetMegacoreBarrierSyncFlagNumber(target); // = base+count (reserved megacore slot)
sflag = LloRegionBuilder::SflagImmPtr(this, n, "megacore barrier sync flag", 26);
for (v62 = 1; v62 != core_count /*==2*/; ++v62) { // one partner when core_count==2
delta = ScalarU32Constant(v62); // +1
partner = SaddS32(core, delta); // core + 1
partner = SmodU32(partner, core_count /*==2*/); // (core + 1) mod 2 -> the OTHER core
addr = EncodeRemoteSyncFlagAddress(this, sflag, &chip, /*multicast=*/0);
op = LloInstruction::CreateVectorSyncFlagAddRemote(addr, ScalarU32Constant(1), region, ...);
LloRegion::AppendInstruction(region, op, 0); // bump partner's megacore SFLAG by 1
}
wait = VwaitGeSV(this, sflag, ScalarU32Constant(core_count - 1) /*==1*/, 0); // wait for partner's bump
wait.set_annotation_if_not_constant("barrier-megacore-wait");
VsyncAdd(this, sflag, ScalarU32Constant(1 - core_count) /*== -1*/); // decrement back to 0
// ... ScheckEq verification on the slot ...
The arithmetic SmodU32(CoreIndex + step, core_count) is the even/odd partner map: for core_count == 2 it sends to (core + 1) mod 2, i.e. core 0 signals core 1 and core 1 signals core 0. Each core bumps the partner's copy of the reserved megacore SFLAG by 1 (VectorSyncFlagAddRemote, addressed through EncodeRemoteSyncFlagAddress), waits until its own slot reaches core_count − 1 == 1 (VwaitGeSV, annotated "barrier-megacore-wait"), then decrements by 1 − core_count == −1 (VsyncAdd) to leave the slot at zero for reuse. A MegacoreBarrierAnn annotation variant is attached to the control-flow push (VCcfPush / SCcfPush) so a later pass can recognize the fold.
NOTE — the page emits two near-identical bodies: a vector path (
VectorU32Constant → VCcfPush → … → VCcfPop → SimplifyVtos, taken on the maincore_count == 2arm) and a scalar path (ScalarU32ConstantImpl → SCcfPush, taken when the predication byte selects it). Both gate oncore_count == 2and both run the sameGetMegacoreBarrierSyncFlagNumber → SflagImmPtr("megacore barrier sync flag") → SaddS32/SmodU32 → VsyncAddRemotefold; the difference is vector-vs-scalar control-flow push. The detailedVsyncAdd/VwaitGeSVSFLAG handshake semantics are in Barriers.
2.4 The program-descriptor state fold (the runtime twin)
net_util::SynchronizeProgramDescriptorStatesMegacore (0x1c697540) is the second consumer of the megacore SFLAG slot. It rendezvouses the two cores' program-descriptor state word (an SMEM control word) rather than a data barrier, and is gated identically:
// SynchronizeProgramDescriptorStatesMegacore — 0x1c697540 (decompiled, condensed)
if (!TpuChipConfig::Megacore(chip_config) || *(int*)(chip_config + 124) < 2)
return; // same Megacore() + 0x7c>=2 gate
// fingerprint id: unique_id + "%d"(2) + ToShortString + cookie_ordinal_ (++)
// annotation "SynchronizeProgramDescriptorStatesMegacore cookie: %u @ %s:%d"
CHECK(CoresPerChip(0) / LogicalDevicesPerChip(0) == 2,
"sync_b.target().TensorCoresPerLogicalDevice() == 2"); // net_util.cc:1337
n = Target::GetMegacoreBarrierSyncFlagNumber(target); // base+count
sflag = SflagImmPtr(b, n, "megacore barrier sync flag", 26);
pds = SmemWordImmPtr(b, Target::ProgramDescriptorStateWordOffset(target),
"program descriptor state", 24);
// PrimaryCoreRegionBuilder + SecondaryCore -> EnqueueRemoteSst(pds) to the partner;
VwaitGeSV(b, sflag, 1, 0).set_annotation_if_not_constant("...wait");
VsyncAdd(b, sflag, SimmS32(-1));
Sfence(b);
// ScheckEq verification
The CHECK TensorCoresPerLogicalDevice() == 2 (= CoresPerChip(TC) / LogicalDevicesPerChip(TC), net_util.cc:1337) is the third independent statement of the two-cores-per-logical-device megacore invariant. This fold uses EnqueueRemoteSst to copy the primary core's program-descriptor state to the secondary, fenced by the same reserved megacore SFLAG slot.
2.5 The megacore SFLAG slot has exactly three consumers
A full .text E8/E9 cross-reference of Target::GetMegacoreBarrierSyncFlagNumber (0x1d60f4e0) finds exactly three callers — confirming the reserved base+count slot is consumed only by runtime/analysis paths, never via a BarrierConfig:
| Consumer (VMA) | Role |
|---|---|
LloRegionBuilder::BarrierMegacore (0x1d522641) | the TC even/odd data barrier (§2.3) |
net_util::SynchronizeProgramDescriptorStatesMegacore (0x1c6977fc) | the program-descriptor state fold (§2.4) |
RaceAnalyzerStepper::PreProcessEvent (0x10bb3229) | reserves/accounts the slot in the LLO race analyzer |
3. GetMegacoreBarrierSyncFlagNumber — the Reserved Slot Accessor
The megacore SFLAG number is computed by Target::GetMegacoreBarrierSyncFlagNumber (0x1d60f4e0) as base + count — i.e. the first slot above the usable per-id barrier window:
// Target::GetMegacoreBarrierSyncFlagNumber — 0x1d60f4e0 (decompiled)
__int64 Target::GetMegacoreBarrierSyncFlagNumber(Target *this) {
if (!TpuChipConfig::Megacore(chip_config(this)))
CHECK_FAIL("topology_->chip_config().Megacore()"); // target.cc:154
return (uint32_t)(this[560] + this[561]); // base(+0x8c0) + count(+0x8c4)
}
this[560] and this[561] are _DWORD indices, i.e. byte offsets 0x8c0 (base) and 0x8c4 (count). The accessor is Megacore()-gated with a hard CHECK (target.cc:154) — calling it on a non-megacore chip aborts. Contrast the two sibling reserved accessors that index the same block top:
| Accessor (VMA) | Formula | Slot | Consumers |
|---|---|---|---|
GetMegacoreBarrierSyncFlagNumber (0x1d60f4e0) | base + count | base+count+0 | BarrierMegacore / SyncProgDesc…Megacore / RaceAnalyzer (§2.5) |
GetAllReduceSyncFlagNumber(n) (0x1d60f440) | base + count + n + 1 (0 < n < 3) | +2, +3 | RotatedPincer* / AsyncPincer* InitSyncFlags (Hierarchical / Pincer) |
GetGlobalBarrierSyncFlagNumber (0x1d60f420) | base + count + 4 | +4 | GetBarrierSyncFlag(GLOBAL) + tree barriers |
GetAllReduceSyncFlagNumber CHECKs its phase argument phase > 0 (target.cc:143) and phase < 3 (target.cc:144), so n ∈ {1, 2} map to base+count+2 and base+count+3; the base+count+1 slot (the illegal n=0) is an unused gap. These four occupied slots plus the gap are exactly the five reserved by Target::Init (§4).
4. The TC SFLAG Block Init (Target::Init)
Target::Init (0x1d60fc20, target.cc) fills the TensorCore reserved-barrier block from the chip's special-purpose SFLAG range at boot. The two field stores are byte-confirmed in the decompile:
// Target::Init — 0x1d60fc20 (decompiled, the TC barrier-block filler)
SpecialPurposeSyncFlags = TpuChipConfig::GetSpecialPurposeSyncFlags(chip_config /* kTensorCore */);
if (!SpecialPurposeSyncFlags)
DieBecauseNull("chip_config.GetSpecialPurposeSyncFlags(::tpu::TpuCoreType::kTensorCore)"); // target.cc
// ... copy + contiguity-check the int range (must be a contiguous ascending run) ...
*((_DWORD *)target + 560) = *v286; // Target+0x8c0 = arr[0] (TC base)
*((_DWORD *)target + 561) = v284 - 5; // Target+0x8c4 = arr.size() - 5 (TC usable count)
GetSpecialPurposeSyncFlags(kTensorCore)(0x20afcf40) returns a{ptr, count}vector of SFLAG numbers — a per-TpuCoreTypeslice of the chip-config table (TpuChipConfig+0x2a0 + core).Target::Initcopies it, then asserts (an unrolledinc/cmp/jneloop) that the numbers form a contiguous ascending integer range so thatbase + offsetaddressing is valid.Target+0x8c0 = arr[0]— the first special-purpose SFLAG number (the block base).Target+0x8c4 = arr.size() − 5— the usable per-id barrier count. The−5reserves the top five slots of the range for the named runtime accessors:
TpuChipConfig::GetSpecialPurposeSyncFlags(kTensorCore) = [ base ............................. base+size-1 ]
|<--- usable per-id window --->|<-- 5 reserved -->|
Target+0x8c0 = base
Target+0x8c4 = size - 5 = count base .. base+count-1 : REPLICA(2)/CUSTOM(3) ids (id < count)
base+count+0 : MEGACORE (GetMegacoreBarrierSyncFlagNumber)
base+count+1 : (unused gap; GetAllReduceSyncFlagNumber(0) illegal)
base+count+2 : AllReduce phase 1
base+count+3 : AllReduce phase 2
base+count+4 : GLOBAL (GetGlobalBarrierSyncFlagNumber)
So the megacore barrier slot is the first reserved slot (base+count+0), sitting directly above the usable per-id window that REPLICA/CUSTOM draw from. The base Target constructor (0x1d615340) zero-defaults the block (+0x8c0 = 0, +0x8c4 = 0) before Init installs the real per-generation values.
NOTE — the literal
GetSpecialPurposeSyncFlags(TC).size()per chip codename (hence the concreteTarget+0x8c4count) is populated from thechip_config/chip_partsproto and was not extracted from the binary; the writer and its formula (base = arr[0],count = size − 5) are byte-confirmed (LOW only for the per-codename literal). The SparseCore mirror (SparseCoreTarget::Init, base+0x1d0, count+0x1d4with no−5) and the three-region SFLAG partition are in SpecialPurposeSyncFlags — the SC engine reserves nothing at the top because its barriers carry a reserved id within the block.
4.1 Why the megacore slot is base+count, not base+id
This is the crux of why MEGACORE(4) deliberately bypasses BarrierConfig. A per-collective REPLICA/CUSTOM barrier lives inside the [base, base+count) window, indexed by a colored id (§1.1). The megacore fold is not per-collective — it is a fixed chip-level rendezvous between the two cores, so it needs a single dedicated slot that no per-id barrier can collide with. Placing it at base+count (above the usable window) and reaching it through GetMegacoreBarrierSyncFlagNumber (not through GetBarrierSyncFlag) guarantees that. Had MEGACORE(4) been a real BarrierConfig, it would have resolved to base + id — inside the per-collective window — which is exactly the collision the design avoids.
5. Relation to the Higher-Level MegacoreFusion HLO Pass
The runtime fold above is distinct from the compiler-level xla::MegacoreFusion HLO pass (RunImpl 0x110d8f00, DoMegacoreFusion 0x110d8860, TU tpu_megacore_fusion.cc). That pass runs before lowering: it pairs an all-reduce with a compute op (e.g. FindARPair 0x110d5560, FindConvARsMatch 0x110d2980, GetAllReduceCosts 0x110d1bc0) so that one core's collective overlaps the other core's matmul/convolution — a scheduling fusion across the megacore pair. The BarrierMegacore / SynchronizeProgramDescriptorStatesMegacore constructs on this page are the lowering-time rendezvous that make such a paired schedule correct: they are what forces the two cores back into agreement at the fold point. A reimplementer must keep the two layers separate — the HLO pass decides what overlaps, the runtime fold guarantees the two halves synchronize on the reserved megacore SFLAG.
NOTE — the
MegacoreFusionHLO pass internals (the AR/conv pairing heuristic, the latency-bound matching, the cost model it queries) are out of scope here; this page documents only that it is a separate, earlier layer. The pincer / binomial AllReduce emitters whose arms it may pair are in Hierarchical / Pincer and Binomial / Recursive-Doubling.
6. Reimplementation Checklist
- Do not emit
MEGACORE(4). No pass writesBarrierConfig.barrier_type == 4. Restrict the config producers toGLOBAL(1)/REPLICA(2)/CUSTOM(3); leave0/4unreachable. InGetBarrierSyncFlag, types{2,3,4}share onebase + idarm bound-checkedid < count— there is no per-type 2/3/4 logic. - Gate the runtime fold three ways. Emit
BarrierMegacoreonly whenTpuChipConfig::Megacore()∧TpuChipConfig+0x7c ≥ 2∧CoresPerChip(TC) == 2. Re-assertCHECK(core_count == 2)(andTensorCoresPerLogicalDevice() == 2for the descriptor fold) — the even/odd fold is defined only for exactly two cores per logical device. - Build the barrier id from a fingerprint, not a config id.
Fingerprint2011(operand)⊕"%d"-format of the discriminator +HloInstruction::unique_id()+ a per-module monotonecookie_ordinal_(CHECK< INT32_MAX, post-increment), folded to 32 bits. Annotate"BarrierMegacore cookie: %u @ %s:%d". - Run the even/odd sync-add. Per partner step
v ∈ [1, core_count):partner = SmodU32(SaddS32(CoreIndex, v), core_count);VectorSyncFlagAddRemote(EncodeRemoteSyncFlagAddress(megacore_sflag, chip), +1). ThenVwaitGeSV(megacore_sflag, core_count − 1)(annotate"barrier-megacore-wait") andVsyncAdd(megacore_sflag, 1 − core_count)to reset to zero. - Reserve the megacore slot. In
Target::Init,base = arr[0] → +0x8c0,count = GetSpecialPurposeSyncFlags(TC).size() − 5 → +0x8c4; the contiguity-checked range's top five are{Megacore @base+count, gap @+1, AllReduce1 @+2, AllReduce2 @+3, Global @+4}. Compute the megacore slot asbase + countvia aMegacore()-gated accessor — never asbase + id. - Synchronize program-descriptor state too.
SynchronizeProgramDescriptorStatesMegacorecopies the primary core'sProgramDescriptorStateWordOffsetSMEM word to the secondary (EnqueueRemoteSst), fenced (Sfence) on the same reserved megacore SFLAG.
Cross-References
- Collectives Overview — the two-substrate collective stack and the strategy picker; megacore is a per-chip fold orthogonal to the per-axis ring schedule.
- AllReduce Hierarchical / Pincer — the bidirectional-ring fusion whose arms the
MegacoreFusionHLO pass may pair across the two cores; also the consumer of the reserved AllReduce SFLAG slots (base+count+2/+3). - Binomial / Recursive-Doubling — the self-completing butterfly emitter (latency-bound), the other AllReduce family.
- Physical Core Placement — how logical device ids map to physical TensorCores (the even/odd pair this fold synchronizes).
- ReduceScatter — the reduce-scatter phase that runs per core before the megacore fold.
- Megacore Even/Odd Split — the twist topology: which core is primary vs secondary and how the pair maps onto the torus.
- Barriers — the SFLAG-based barrier model and the full
BarrierTypetaxonomy / handshake protocol. - TensorCore Barrier — the TensorCore-side barrier lowering and SFLAG binding.
- SpecialPurposeSyncFlags — the
Target::Init/SparseCoreTarget::InitSFLAG-block fillers and the three-region partition. - InferBarrierConfig — the
BarrierConfignormalizer (one of the two type producers, confirmed never to write4). - back to index