GetSparseCoreConfig — The Offload Op-Type Enum Source
Addresses apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d). Other versions differ. All.text/.rodataaddresses are virtual; for this binary.textVMA == file offset0xe63c000,.rodataVMA == file offset0x84a0000, and.data.rel.roVMA −0x200000== file offset.
Abstract
GetSparseCoreConfig is the one resolver that every SparseCore-offload classification site funnels through: given an async HLO instruction, it parses the instruction's BackendConfig and hands back a fully-materialized xla::jellyfish::SparseCoreConfig proto (a copy, never a borrowed pointer). The single field the scheduler cares about is field 2 offload, a TYPE_ENUM of type .xla.jellyfish.Offload, stored in the C++ object at +0x24 with its presence has-bit at +0x10 mask 0x4. That one enum decides whether an offloaded async op consumes a SparseCore gather / scatter / data-formatting / kernel / sort resource lane, recurses into a wrapped collective, or falls through to the per-core general SparseCore resource — i.e. which physical SparseCore engine class the latency-hiding scheduler throttles the op against.
This page recovers three byte-exact things. First, the GetSparseCoreConfig resolver itself (0x1c868d20): its thread-name guard (kSparseCoreThread / kSparseCoreOffloadCandidateThread), its copy-construction of SparseCoreConfig with a globals fallback, and the consumer-side read of offload at +0x24 behind has-bit +0x10 & 4. Second, the full xla::jellyfish::Offload enum (OFFLOAD_UNSPECIFIED 0 .. OFFLOAD_COMPUTE 8) and exactly which enumerator maps to which scheduler resource arm (enum − 2 indexing in the scheduler, enum − 1 in the reservation map). Third, the SC-offload gate bits — Target[+0x628] & 4 (an SC-offload-capability has-bit) and Target[+0x540] (a platform_type == 2 bool) — traced to where jellyfish::Target::Init sets them, plus the per-generation default basis (TpuVersion == 5).
For reimplementation, the contract is:
GetSparseCoreConfig(async_start)→SparseCoreConfig; the consumer readsoffload(field 2, object+0x24) only when has-bit+0x10 & 4is set, else the op classifies as no SC resource.- The
offloadenumerator selects the SC resource lane:GATHER/SCATTER/DATA_FORMATTING/KERNEL/SORT→kSparseCore{Gather,Scatter,DataFormatting,Kernel,Sort}(scheduler ids 23/24/25/26/27);COLLECTIVE→ recurse into the async-wrapped instruction;UNSPECIFIED/EMBEDDING/COMPUTE→ no{23..27}lane (the general id-22 path). - The SC-offload scheduler sub-pass runs only when the gate holds:
Megachip ∧ CoresPerChip(SC) > 0 ∧ (Target[+0x628] & 4 ∨ Target[+0x540]) ∧ ModuleContainsLEMSparseCoreInstruction ∧ FLAGS_xla_sc_enable_latency_hiding_scheduler. - On real hardware the gate bit
Target[+0x628] & 4is set per-generation; the SC-offload-concurrency defaults key onTpuVersion == 5(the newest generation), overridable by a TCEAutoOr<bool>flag.
| Resolver | xla::jellyfish::backend_config_util::GetSparseCoreConfig @ 0x1c868d20 |
| Returns | xla::jellyfish::SparseCoreConfig (copy; ctor-copy @ 0x1d6df7c0) |
| Classifier field | offload (field 2, enum .xla.jellyfish.Offload), object +0x24, has-bit +0x10 & 4 |
| Offload enum | 9 enumerators, OFFLOAD_UNSPECIFIED 0 .. OFFLOAD_COMPUTE 8 |
| Scheduler consumer | TpuAsyncTracker::MayAddSparseCoreResource @ 0x11000480 (index enum − 2) |
| Reservation consumer | (anon)::GetSparseCoreResources @ 0x10fdc0a0 (index enum − 1) |
| Gate bits | Target[+0x628] & 4 (SC-offload-capability) ∨ Target[+0x540] != 0 (platform_type == 2) |
| Gate site | SparseCoreCompiler::RunHloScheduler @ 0x1306f820 |
| Bits set in | jellyfish::Target::Init @ 0x1d60fc20 |
| Per-gen default | TpuVersion == 5 (ShouldEnableConcurrentSparseCoreOffloading @ 0x1d6b6f80) |
| Source file | platforms/xla/service/jellyfish/lowering/backend_config_util.cc |
| Confidence | CONFIRMED (byte-anchored) unless a row says otherwise |
The Resolver — GetSparseCoreConfig @ 0x1c868d20
The resolver takes the async-start HLO and returns a SparseCoreConfig by value (caller-provided sret buffer). It is not a pointer accessor: it copy-constructs a fresh proto, so the consumer's offload/has-bit reads are on a private copy that outlives the instruction's backend-config arena.
// backend_config_util::GetSparseCoreConfig(this_sret, async_start) @ 0x1c868d20
SparseCoreConfig *GetSparseCoreConfig(SparseCoreConfig *this, const HloInstruction *a2) {
if (!a2) { SparseCoreConfig::SparseCoreConfig(this, /*arena=*/0); return this; } // null → default
CHECK(a2->opcode() == kAsyncStart); // [a2+0xc] == 17 (src line 472)
// thread-name guard (src line 475):
auto t = a2->async_execution_thread();
CHECK(t == "sparsecore" // len 10 "sparsecore"
|| t == kSparseCoreOffloadCandidateThread); // len 28
auto cfg = a2->backend_config<jellyfish::BackendConfig>(); // StatusOr<BackendConfig>
if (cfg.ok()) {
const SparseCoreConfig *src = cfg->sparse_core_config();
if (!src) src = &SparseCoreConfig_globals_; // default-instance fallback
SparseCoreConfig::SparseCoreConfig(this, /*arena=*/0, src);// copy-construct
} else {
SparseCoreConfig::SparseCoreConfig(this, /*arena=*/0); // parse failed → default
}
/* unref the StatusOr's status rep, destroy the BackendConfig temp */
return this;
}
Two guards bracket the parse:
- Opcode guard. The instruction must be
kAsyncStart(opcode17=0x11). TheCHECKmessage is"async_start->opcode() == HloOpcode::kAsyncStart"(src line 472). - Thread-name guard. The async-execution thread must be
"sparsecore"(the 10-bytekSparseCoreThread, compared via the immediate0x6F63657372617073="sparsec"+0x6572="or"low bytes) or the 28-bytekSparseCoreOffloadCandidateThread. This is the SIMDvptestcompare in the disassembly; theCHECKmessage names both threads (src line 475).
NOTE — the resolver returns a default, never a null pointer. On a null instruction, a non-
kAsyncStartopcode (in theMayAdd*callers it is pre-filtered, notCHECKed), a parse failure, or aBackendConfigwith nosparse_core_configsub-message, the result is a default-constructedSparseCoreConfig— all has-bits clear. The consumer'shas_offload()test (+0x10 & 4) is therefore the real gate: a default proto has the bit clear and classifies as "no SC resource". A reimplementation must not treat the resolver as fallible at the call site; the fallibility is folded into the cleared has-bits.
The SparseCoreConfig Proto — Field Map
GetSparseCoreConfig returns the full SparseCoreConfig; only offload (field 2) is read by the scheduler/reservation classifiers, but the complete field map (from SparseCoreConfig::_InternalSerialize @ 0x1d6dfae0) fixes the object layout and confirms offload lives at object +0x24 behind has-bit +0x10 & 4.
| field | name | object off | has-bit (obj+0x10) | proto type |
|---|---|---|---|---|
| 1 | tiling | +0x20 (i32) | 0x2 | enum .xla.jellyfish.Tiling |
| 2 | offload ← the SC op-type enum | +0x24 (i32) | 0x4 | enum .xla.jellyfish.Offload |
| 3 | comp_env | +0x18 (msg) | 0x1 | msg ScCompilationEnvironment |
| 4 | enable_megacore | +0x2c (bool) | 0x10 | bool |
| 5 | hbm_bandwidth_adjustment_factor | +0x28 (f32) | 0x8 | float (fixed32) |
| 6 | function_mode | +0x2d (byte) | 0x20 | enum/bool |
| 7 | dedup_id | +0x30 (i64) | 0x80 | int64 |
| 8 | enable_program_barrier | +0x2e (bool) | 0x40 | bool |
| 9 | load_dat | +0x38 (bool) | 0x100 | bool |
The offload FieldDescriptorProto is carved byte-exact: tag 0a07 ("offload", 7 bytes), 1802 (number = 2), 2001 (label = optional), 280e (type = TYPE_ENUM = 14), 3216 (.xla.jellyfish.Offload, 0x16 bytes). The descriptor type strings .xla.jellyfish.Offload, .xla.jellyfish.Tiling, and the field names enable_megacore / hbm_bandwidth_adjustment_factor / function_mode / dedup_id / enable_program_barrier / load_dat are all present in .rodata.
GOTCHA —
offloadis a backend-config enum, not a custom-call target or MLIR op kind. The op-type classification keyed byGetSparseCoreConfigis a proto field on the instruction's backend config. This is a different mechanism from the plainSparseCoreAsyncTracker, which keys on the custom-call target string ("AllToAllDynamic") via a separate enum (SparseCoreOperationType) — see The Plain Tracker Keys on a Target Name, Notoffload. Do not conflate the two:offload(this page) andSparseCoreOperationTypeare distinct enums with distinct value spaces and distinct consumers.
The xla::jellyfish::Offload Enum
Decoded byte-exact from the EnumDescriptorProto (each value is a 10 NN EnumValueDescriptorProto). All nine enumerator name strings (OFFLOAD_UNSPECIFIED .. OFFLOAD_COMPUTE) are present in .rodata.
| value | enumerator | semantic |
|---|---|---|
| 0 | OFFLOAD_UNSPECIFIED | unset / default |
| 1 | OFFLOAD_EMBEDDING | embedding lookup/update offload |
| 2 | OFFLOAD_GATHER | gather op-class |
| 3 | OFFLOAD_SCATTER | scatter op-class |
| 4 | OFFLOAD_COLLECTIVE | collective (recurse into wrapped op) |
| 5 | OFFLOAD_DATA_FORMATTING | data-formatting op-class |
| 6 | OFFLOAD_KERNEL | generic SC kernel op-class |
| 7 | OFFLOAD_SORT | sort op-class |
| 8 | OFFLOAD_COMPUTE | compute offload |
Scheduler Consumer — MayAddSparseCoreResource @ 0x11000480
TpuAsyncTracker::MayAddSparseCoreResource is the producer that turns an offloaded async op into scheduler resource ids. It first re-applies the same thread guard (async + thread "sparsecore" / kSparseCoreOffloadCandidateThread, else return), calls GetSparseCoreConfig, then switches on offload only if the has-bit is set. The switch index is offload directly (the add 0xfffffffe / jump table covers values {2..7}, i.e. enum − 2 after the table base).
// MayAddSparseCoreResource(this, async_start, &out) @ 0x11000480 (condensed)
SparseCoreConfig cfg; GetSparseCoreConfig(&cfg, async_chain_start(async_start));
int usage = /* 2 = kResourceOccupy on start, 1 = kResourceRelease on done */;
if (cfg.has_bits & 0x4) { // v55 & 4 → has_offload()
switch (cfg.offload) { // v56 = [&cfg + 0x24]
case 2: out.push_back({23, usage}); break; // OFFLOAD_GATHER → kSparseCoreGather
case 3: out.push_back({24, usage}); break; // OFFLOAD_SCATTER → kSparseCoreScatter
case 4: MayAddSparseCoreResource$_0(&out, // OFFLOAD_COLLECTIVE → recurse on
async_wrapped_instruction(async_start), usage); break; // async-wrapped op
case 5: out.push_back({25, usage}); break; // OFFLOAD_DATA_FORMATTING → kSparseCoreDataFormatting
case 6: out.push_back({26, usage}); break; // OFFLOAD_KERNEL → kSparseCoreKernel
case 7: out.push_back({27, usage}); break; // OFFLOAD_SORT → kSparseCoreSort
default: break; // 0/1/8 → no {23..27} arm
}
}
// independent of the offload switch — the general per-core SparseCore resource (id 22):
if (this[+0x13b] == 1) { // a1+315 — "per-core" gate
for (i = 0; i < GetNumSparseCoresUsed(async_chain_start(async_start)); i++)
out.push_back({22, usage}); // id 22 once per used SC core
} else {
out.push_back({22, usage}); // id 22 once
}
offload | enumerator | scheduler arm (MayAddSparseCoreResource, idx enum − 2) |
|---|---|---|
| 0 | OFFLOAD_UNSPECIFIED | no arm — id 22 path only |
| 1 | OFFLOAD_EMBEDDING | no arm — id 22 path only |
| 2 | OFFLOAD_GATHER | id 23 kSparseCoreGather |
| 3 | OFFLOAD_SCATTER | id 24 kSparseCoreScatter |
| 4 | OFFLOAD_COLLECTIVE | recurse into async_wrapped_instruction ($_0 @ 0x110008e0) |
| 5 | OFFLOAD_DATA_FORMATTING | id 25 kSparseCoreDataFormatting |
| 6 | OFFLOAD_KERNEL | id 26 kSparseCoreKernel |
| 7 | OFFLOAD_SORT | id 27 kSparseCoreSort |
| 8 | OFFLOAD_COMPUTE | no arm — id 22 path only |
The resource ids {22..27} are the SparseCore engine classes in the 47-id scheduler ResourceType enum; see ResourceType Taxonomy for their concurrency caps and hazard classes.
QUIRK —
OFFLOAD_COLLECTIVErecurses, it does not emit a lane. Value 4 is the one arm that does not push akSparseCore*id. Instead it re-entersMayAddSparseCoreResource(the$_0lambda at0x110008e0) on the async-wrapped instruction, so a collective wrapped inside an SC-offload async op is classified by the wrapped op's ownoffloadfield. A reimplementation that maps value 4 to a resource id will double-count or mis-throttle wrapped collectives.
NOTE — the id-22 general path is independent of
offload. The per-corekSparseCore(id 22) emission runs after theoffloadswitch and is gated by a separate byte (this+0x13b == 1), which selects "one id-22 per used SC core" (GetNumSparseCoresUsed) vs "one id-22". So evenOFFLOAD_UNSPECIFIED/EMBEDDING/COMPUTEops — which hit no{23..27}arm — still consume the general SparseCore resource. Theoffloadenum refines which sub-engine; id 22 is the always-present per-core occupancy.
Reservation Consumer — GetSparseCoreResources @ 0x10fdc0a0
The reservation-map twin reads the same SparseCoreConfig.offload field (same +0x24 value, same +0x10 & 4 has-bit), but indexes with enum − 1 (a dec then cmp 6), so its live range is {1..7} — it additionally covers OFFLOAD_EMBEDDING (value 1) in its own arm.
// (anon)::GetSparseCoreResources(async_start) @ 0x10fdc0a0 (condensed)
SparseCoreConfig cfg; GetSparseCoreConfig(&cfg, async_start);
if ((cfg.has_bits & 4) != 0) { // v29[16] & 4 → has_offload()
switch (cfg.offload) { // index = enum, table base enum−1, range {1..7}
case 1: /* OFFLOAD_EMBEDDING → embedding/general reservation arm */ ...
case 2: /* OFFLOAD_GATHER → kSparseCoreGather */ ...
case 3: /* OFFLOAD_SCATTER → kSparseCoreScatter */ ...
case 4: /* OFFLOAD_COLLECTIVE → collective arm */ ...
case 5: /* OFFLOAD_DATA_FORMATTING → kSparseCoreDataFmt */ ...
case 6: /* OFFLOAD_KERNEL → kSparseCoreKernel */ ...
case 7: /* OFFLOAD_SORT → kSparseCoreSort */ ...
}
}
GOTCHA — scheduler indexes
enum − 2, reservation indexesenum − 1. The two consumers read the identical proto field at the identical offset but with a one-off difference in jump-table base. The scheduler (MayAddSparseCoreResource) starts its dense table at value 2 (OFFLOAD_GATHER), soOFFLOAD_EMBEDDING(1) hits no scheduler arm; the reservation map (GetSparseCoreResources) starts at value 1, so it does reserve for embedding. A reimplementer must keep both index bases: do not assume the two classifiers share a switch.
See SC Queue Assignment & Reservation for the reservation-map's resource→limit structure.
The SC-Offload Gate Bits
The SparseCore-offload scheduler sub-pass runs only when SparseCoreCompiler::RunHloScheduler (@ 0x1306f820) finds the gate predicate true. Two of its conjuncts are Target bitfield reads:
// SparseCoreCompiler::RunHloScheduler gate @ 0x1306f820 (object offsets in bytes)
runSC = TpuChipConfig::Megachip( Target[+0x3b8][+0x18] ) // @0x1306f84c
&& *(int*)( Target[+0x3b8] + 0x94 ) > 0 // CoresPerChip(kSparseCore) > 0 @0x1306f863
&& ( (Target[+0x628] & 4) != 0 || Target[+0x540] != 0 ) // the two gate bits
&& offloader_util::ModuleContainsLEMSparseCoreInstruction(M) // @0x1306fbc8
&& FLAGS_xla_sc_enable_latency_hiding_scheduler; // @0x1306fc04
In the decompile this is (*((_BYTE*)this + 1576) & 4) != 0 || *((_BYTE*)this + 1344) — byte 1576 = 0x628, byte 1344 = 0x540. The *(int*)(... + 148) > 0 is the 0x94 CoresPerChip(SC) read. The whole predicate appears twice (the eager check and the SC-path re-test).
Where the bits are set — Target::Init @ 0x1d60fc20
Both fields are written inside jellyfish::Target::Init. The register r12/v342 is the Target* being initialized; v98/*v98 is the first scalar of the TpuTopology (the platform-type enum).
// jellyfish::Target::Init @ 0x1d60fc20 (relevant writes)
Target[+0x540] = (TpuTopology[+0] == 2); // platform_type == 2 (decompile: _R12+1344)
Target[+0x541] = (TpuTopology[+0] == 1); // platform_type == 1 (decompile: _R12+1345)
// inside the predicate-gated config-append loop:
Target[+0x628] |= 1; // bit-0 (config sub-field A has-bit) @0x1d611d52
Target[+0x628] |= 4; // bit-2 (SC-offload-capability has-bit) @0x1d612121
Target[+0x540]is abool=(TpuTopology[+0] == 2). The firstTpuTopologyscalar is the internal platform-type enum; value2is theiss(simulator) path (platform_type == 1lands in the sibling byteTarget[+0x541]). SoTarget[+0x540] != 0force-takes the SC path on the simulator regardless of the capability bit.Target[+0x628]is a_has_bits_-style qword (decompile*((_QWORD*)_R12 + 197)— qword index 197 = byte0x628). Bit-2 (mask0x4) is OR'd in inside an unrolled config-append loop that is itself gated by the SC-offload feature-detect; it is the SC-offload-capability has-bit, set for the eligible (newest-gen) part. Bit-0 (mask0x1) is OR'd earlier in the same loop for a sibling config sub-field. The gate predicate(Target[+0x628] & 4) == 0 → read Target[+0x540]is replayed verbatim insideTarget::Inititself (combined withMegachip ∧ CoresPerChip(SC) > 0) — the SC-offload feature-detect.
| gate bit | object off | meaning | set in Target::Init |
|---|---|---|---|
Target[+0x628] & 4 (bit-2) | +0x628 qword | SC-offload-capability has-bit (per-gen, predicate-gated) | OR'd |= 4 @ 0x1d612121 |
Target[+0x540] | +0x540 byte | platform_type == 2 (iss/simulator) | = (TpuTopology[+0] == 2) @ 0x1d610b1b |
Target[+0x541] | +0x541 byte | platform_type == 1 (sibling, not in gate) | = (TpuTopology[+0] == 1) @ 0x1d610b29 |
Target[+0x628] & 1 (bit-0) | +0x628 qword | sibling config-append has-bit (not in gate) | OR'd |= 1 @ 0x1d611d52 |
NOTE — the exact proto sub-field naming bit-2 was not isolated. The bit-set site, mask, and value are byte-exact, and the bit sits in the predicate-gated config-append loop alongside bit-0 and two SSO strings (object
+0x580/+0x5f0). But the single descriptor entry that names this SC-offload-capability sub-field (a nested config field copied from the chip'svector_isa/TpuSequencerParts) was not pinned to one descriptor. The role — an SC-offload-capability flag the scheduler gate reads — is byte-exact regardless. Confidence: bit position CONFIRMED; sub-field proto name INFERRED.
GOTCHA —
platform_typeenum order is descriptor-string order.TpuTopology[+0]is the topology's first scalar = the internalTpuPlatformTypeenum (per theValidateArgs(TpuPlatformType, …)signature and the provenTpuTopology[+0x8] = TpuChipParts*layout). The gate compares it== 2; the value→name pairing{0 hardware, 1 grm, 2 iss}is taken from descriptor-string order (TpuPlatformTypeToProto = type + 1), not a separately decodedplatform_type()getter. Confidence: the== 2comparison and its gate role CONFIRMED; the enum value→name pairing INFERRED.
Per-Generation Default Basis — TpuVersion == 5
The two SC-offload-concurrency knobs that feed the scheduler default to enabled on exactly one chip generation. Both compute their hardware default as TpuChipParts[+0] == 5 (i.e. tpu::TpuVersion == 5), then let a TpuCompilationEnvironment AutoOr<bool> flag override via the 0x100 "is-set" bit.
// ShouldEnableConcurrentSparseCoreOffloading(tce_view, topo, b) @ 0x1d6b6f80
hw_default = (TpuChipParts[+0] == 5) & ~b; // *(_DWORD*)(topo+8) == 5 → TpuVersion 5
flag = tce[+0x458] ? tce[+0x458] : &AutoProto_globals_; // a1 + 1112 = 0x458
v = AutoOr<bool>::FromProtoOrDie(flag);
return (v & 0x100) ? /*flag set → use flag value*/ : hw_default;
// EnableSparseCoreOffloadQueuingInLhs(tce_view, topo) @ 0x1d6b81e0
hw_default = (TpuChipParts[+0] == 5); // *(_DWORD*)(topo+8) == 5
flag = tce[+0x730] ? tce[+0x730] : &AutoProto_globals_; // a1 + 1840 = 0x730
... same AutoOr<bool> 0x100 override ...
TpuChipParts[+0] is the TpuVersion (the 0-based chip-generation enum); TpuChipParts::ToProto → TpuVersionToProto(v) = v + 1 (confirmed: the decompiled body is literally return v + 1). The codename table from TpuVersionFromString (@ 0x20b3a5a0, init-list @ 0x220117b0):
| TpuVersion (internal) | codename | proto value (= internal + 1) |
|---|---|---|
| 0 | jellyfish | 1 |
| 1 | dragonfish | 2 |
| 2 | pufferfish | 3 |
| 3 | viperfish | 4 |
| 4 | ghostlite | 5 |
| 5 | 6acc60406 | 6 ← SC-offload concurrency default ON |
All six codenames are present verbatim in .rodata (init-list off_220117B0; the value-5 string "6acc60406" lives at .rodata VMA 0x863f0cf, len 9, paired with TpuVersion == 5). See TPU Version Codename Matrix for the full generation map.
NOTE — the override flag field numbers were not decoded. Both knobs read an
AutoOr<bool>at TCE_impl_offsets0x458(concurrency) /0x730(offload-queuing-in-LHS), falling back toAutoProto_globals_when unset; the offsets and the0x100-bit override are byte-exact, but the two proto field numbers (the_InternalSerializetags at those offsets) were not isolated. Confidence: offsets + override mechanism CONFIRMED; field numbers PARTIAL. See TpuCompilationEnvironment and TCE Field Offsets & Defaults.
The Plain Tracker Keys on a Target Name, Not offload
Not every SparseCore scheduler keys on GetSparseCoreConfig. When the gate holds, the SC-offload schedule is produced by a plain SparseCoreAsyncTracker first, and that tracker classifies async-schedulable ops by opcode + custom-call target name, not the offload backend enum. This is the cleanest way to see that offload is one of two distinct SC classification mechanisms in the binary.
// SparseCoreAsyncTracker::IsSupportedAsyncStart(hlo) @ 0x134964c0
bool IsSupportedAsyncStart(const HloInstruction *h) {
int op = h->opcode(); // [h+0xc]
if (op == 12) return true; // 0xc all-to-all
if (op == 17) return true; // 0x11 async-start
if (op == 49) // 0x31 custom-call
return SparseCoreOperationTypeFromString(h->custom_call_target()) == 8; // "AllToAllDynamic"
return false;
}
// IsSupportedAsyncDone @ 0x13496520 is identical except op 16 (0x10 async-done) replaces 17.
SparseCoreOperationTypeFromString (@ 0x14b7f060) is a chained EqualsIgnoreCase mapper over a separate enum, SparseCoreOperationType, whose first eight values are confirmed in order: 1 SparseMap, 2 CooToCsr, 3 CooToEll, 4 SparseMapRow, 5 SortLexicographic, 6 ReduceDuplicates, 7 EllToCsr, 8 AllToAllDynamic (then 9 ScSendToTc, 10 ScReceiveFromTc, … continuing well past 8). The plain tracker gates only on == 8 ("AllToAllDynamic") — it overlaps SC all-to-all ops, and (via PostProcessScheduleGraph → FindNearestAllToAlls @ 0x13496600) biases the schedule toward them.
| classifier | keyed on | enum | values used |
|---|---|---|---|
MayAddSparseCoreResource (this page) | SparseCoreConfig.offload (field 2) | xla::jellyfish::Offload | {2..7} → ids {23..27} + recurse |
GetSparseCoreResources (reservation) | SparseCoreConfig.offload (field 2) | xla::jellyfish::Offload | {1..7} |
SparseCoreAsyncTracker::IsSupportedAsync{Start,Done} | custom-call target string | SparseCoreOperationType | == 8 (AllToAllDynamic) |
QUIRK — two enums, one word "SparseCore op type".
xla::jellyfish::Offload(a backend-config enum, 9 values, drives resource lanes) andSparseCoreOperationType(a custom-call target-name enum, ≥ 8 values, drives async-schedulability) are easy to conflate because both describe "what kind of SparseCore op this is". They are wholly separate: different namespaces, different value spaces, different read paths, different consumers.GetSparseCoreConfigresolves the first;custom_call_target()+SparseCoreOperationTypeFromStringthe second.
Worked Example — An OFFLOAD_SCATTER Async Op on a Newest-Gen Part
A SparseCore-offloaded scatter, on a megacore newest-generation (TpuVersion == 5) part:
%sc = async-start(%coo), execution_thread="sparsecore",
backend_config = { sparse_core_config { offload: OFFLOAD_SCATTER } }
%sc.d = async-done(%sc)
Walking the resolver and the classifiers:
- Gate.
Target::InitsetTarget[+0x628] |= 4for the newest-gen part, so(Target[+0x628] & 4) != 0is true; withMegachip,CoresPerChip(SC) > 0, an LEM SparseCore instruction in the module, and the LHS flag on,SparseCoreCompiler::RunHloSchedulerenters the SC-offload sub-passes. - Resolve. For
%sc,MayAddSparseCoreResourcere-checks the thread ("sparsecore", len 10 — passes), callsGetSparseCoreConfig, which copy-constructs theSparseCoreConfig.has_offload()(+0x10 & 4) is set;offload(+0x24) reads3(OFFLOAD_SCATTER). - Classify. The switch on
3emits{24, kResourceOccupy}— scheduler resource id 24 (kSparseCoreScatter) — on the async-start, and the matching{24, kResourceRelease}on%sc.d. Independently, the id-22 path emits the general per-corekSparseCoreoccupancy. - Throttle. The scheduler caps concurrent id-24 ops at
GetNumAvailableResources(24)(the..._scatter_overlap_limitTCE knob) and treats id 24 as unsharable (hazard 0) — see ResourceType Taxonomy. - Default-instance contrast. Had the backend config omitted
sparse_core_config,GetSparseCoreConfigwould return the default instance,has_offload()would be clear, and%scwould consume only the general id-22 lane — no scatter-specific throttle.
This is exactly why offload is the SC op-type classifier: it is the one proto field that refines a generic SparseCore async op into a specific engine-class resource the scheduler can model.
Confidence Summary
| Claim | Evidence |
|---|---|
GetSparseCoreConfig returns a copy-constructed SparseCoreConfig (globals fallback) | 0x1c868d20: SparseCoreConfig::SparseCoreConfig(this, 0, src), SparseCoreConfig_globals_ |
Thread guard: "sparsecore" (len 10) ∨ kSparseCoreOffloadCandidateThread (len 28); opcode kAsyncStart | 0x1c868d20 SIMD compare + CHECK strings (src 472/475) |
offload = field 2, enum .xla.jellyfish.Offload, object +0x24, has-bit +0x10 & 4 | _InternalSerialize @ 0x1d6dfae0; FieldDescriptorProto carve; consumer reads |
Offload enum OFFLOAD_UNSPECIFIED 0 .. OFFLOAD_COMPUTE 8 (9 values) | EnumDescriptorProto @ 0xbfa1f9f; all 9 strings in .rodata |
Scheduler arm map {2→23, 3→24, 4→recurse, 5→25, 6→26, 7→27}, idx enum − 2 | MayAddSparseCoreResource @ 0x11000480 switch (decompiled) |
Reservation map reads same field, idx enum − 1, range {1..7} (covers EMBEDDING) | GetSparseCoreResources @ 0x10fdc0a0: v29[16] & 4, case 1..7 |
id-22 (kSparseCore) path is independent of offload, gated this+0x13b == 1 | 0x11000480: post-switch loop on GetNumSparseCoresUsed |
Gate predicate (Target[+0x628] & 4) ∨ Target[+0x540] plus Megachip/CoresPerChip(SC)/LEM/flag | RunHloScheduler @ 0x1306f820: this+1576 & 4, this+1344, +148 > 0 |
Target[+0x540] = (TpuTopology[+0] == 2); Target[+0x541] = (== 1) | Target::Init @ 0x1d60fc20 lines _R12+1344/+1345 = *v98 == 2/1 |
Target[+0x628] |= 4 (bit-2) and |= 1 (bit-0) in predicate-gated config loop | Target::Init: v342+197 = … | 4 / | 1 (qword 197 = 0x628) |
| The proto sub-field that names bit-2 (SC-offload-capability) | bit-set site byte-exact; descriptor entry not isolated |
platform_type value→name pairing {0 hardware, 1 grm, 2 iss} | descriptor-string order + ToProto = type+1; == 2 comparison byte-exact |
Per-gen default TpuVersion == 5; AutoOr<bool> 0x100 override at TCE +0x458/+0x730 | 0x1d6b6f80 (*(topo+8)==5, a1+1112); 0x1d6b81e0 (a1+1840) |
| TCE override field numbers for the two SC-offload knobs | offsets + override byte-exact; field numbers not decoded |
TpuVersion 0..5 = jellyfish/dragonfish/pufferfish/viperfish/ghostlite/6acc60406; proto = +1 | TpuVersionFromString init-list @ 0x220117b0 (v5 string "6acc60406" @ .rodata 0x863f0cf); TpuVersionToProto body v+1 |
Plain SparseCoreAsyncTracker keys on opcode {0xc, 0x11/0x10, 0x31} + SparseCoreOperationType == 8 | IsSupportedAsyncStart/Done @ 0x134964c0/0x13496520; FromString @ 0x14b7f060 |
SparseCoreOperationType values 1..8 = SparseMap..AllToAllDynamic | 0x14b7f060 chained EqualsIgnoreCase (decompiled, in order) |
Cross-References
- SparseCore Overview — the SCS/TAC/TEC engine model the offloaded op-classes target.
- SparseCore Architecture — the hardware engines behind the
kSparseCore{Gather,Scatter,DataFormatting,Kernel,Sort}resource lanes. - SC Back-End Pipeline — where SC-offload scheduling sits in the SparseCore compile flow.
- Stream Gather/Scatter — the gather/scatter datapath that
OFFLOAD_GATHER/OFFLOAD_SCATTERroute to. - SC Queue Assignment & Reservation — the reservation-map (
GetSparseCoreResources) twin of this classifier. - SparseCore vs Neuron MatMultSparse — cross-architecture contrast of the sparse offload model.
- ResourceType Taxonomy — the 47-id scheduler enum; the caps/hazards for ids 22..27 this page emits, and the three-tracker selection gate.
- LatencyHidingScheduler Core — the list scheduler that consumes the resource ids.
- TpuCompilationEnvironment — the TCE that carries the
AutoOr<bool>SC-offload override flags (+0x458/+0x730). - TCE Field Offsets & Defaults — TCE
_impl_offset table. - TPU Version Codename Matrix — the
TpuVersion0..5 generation map (default-on basis = version 5). - TPU Topology Struct — the
TpuTopologywhose first scalar (platform_type) setsTarget[+0x540]. - Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part IX — SparseCore & BarnaCore / SparseCore back-end — back to index