Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GetSparseCoreConfig — The Offload Op-Type Enum Source

Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build-id 89edbbe81c5b328a958fe628a9f2207d). Other versions differ. All .text/.rodata addresses are virtual; for this binary .text VMA == file offset 0xe63c000, .rodata VMA == file offset 0x84a0000, and .data.rel.ro VMA − 0x200000 == file offset.

Abstract

GetSparseCoreConfig is the one resolver that every SparseCore-offload classification site funnels through: given an async HLO instruction, it parses the instruction's BackendConfig and hands back a fully-materialized xla::jellyfish::SparseCoreConfig proto (a copy, never a borrowed pointer). The single field the scheduler cares about is field 2 offload, a TYPE_ENUM of type .xla.jellyfish.Offload, stored in the C++ object at +0x24 with its presence has-bit at +0x10 mask 0x4. That one enum decides whether an offloaded async op consumes a SparseCore gather / scatter / data-formatting / kernel / sort resource lane, recurses into a wrapped collective, or falls through to the per-core general SparseCore resource — i.e. which physical SparseCore engine class the latency-hiding scheduler throttles the op against.

This page recovers three byte-exact things. First, the GetSparseCoreConfig resolver itself (0x1c868d20): its thread-name guard (kSparseCoreThread / kSparseCoreOffloadCandidateThread), its copy-construction of SparseCoreConfig with a globals fallback, and the consumer-side read of offload at +0x24 behind has-bit +0x10 & 4. Second, the full xla::jellyfish::Offload enum (OFFLOAD_UNSPECIFIED 0 .. OFFLOAD_COMPUTE 8) and exactly which enumerator maps to which scheduler resource arm (enum − 2 indexing in the scheduler, enum − 1 in the reservation map). Third, the SC-offload gate bitsTarget[+0x628] & 4 (an SC-offload-capability has-bit) and Target[+0x540] (a platform_type == 2 bool) — traced to where jellyfish::Target::Init sets them, plus the per-generation default basis (TpuVersion == 5).

For reimplementation, the contract is:

  • GetSparseCoreConfig(async_start)SparseCoreConfig; the consumer reads offload (field 2, object +0x24) only when has-bit +0x10 & 4 is set, else the op classifies as no SC resource.
  • The offload enumerator selects the SC resource lane: GATHER/SCATTER/DATA_FORMATTING/KERNEL/SORTkSparseCore{Gather,Scatter,DataFormatting,Kernel,Sort} (scheduler ids 23/24/25/26/27); COLLECTIVE → recurse into the async-wrapped instruction; UNSPECIFIED/EMBEDDING/COMPUTE → no {23..27} lane (the general id-22 path).
  • The SC-offload scheduler sub-pass runs only when the gate holds: Megachip ∧ CoresPerChip(SC) > 0 ∧ (Target[+0x628] & 4 ∨ Target[+0x540]) ∧ ModuleContainsLEMSparseCoreInstruction ∧ FLAGS_xla_sc_enable_latency_hiding_scheduler.
  • On real hardware the gate bit Target[+0x628] & 4 is set per-generation; the SC-offload-concurrency defaults key on TpuVersion == 5 (the newest generation), overridable by a TCE AutoOr<bool> flag.
Resolverxla::jellyfish::backend_config_util::GetSparseCoreConfig @ 0x1c868d20
Returnsxla::jellyfish::SparseCoreConfig (copy; ctor-copy @ 0x1d6df7c0)
Classifier fieldoffload (field 2, enum .xla.jellyfish.Offload), object +0x24, has-bit +0x10 & 4
Offload enum9 enumerators, OFFLOAD_UNSPECIFIED 0 .. OFFLOAD_COMPUTE 8
Scheduler consumerTpuAsyncTracker::MayAddSparseCoreResource @ 0x11000480 (index enum − 2)
Reservation consumer(anon)::GetSparseCoreResources @ 0x10fdc0a0 (index enum − 1)
Gate bitsTarget[+0x628] & 4 (SC-offload-capability) ∨ Target[+0x540] != 0 (platform_type == 2)
Gate siteSparseCoreCompiler::RunHloScheduler @ 0x1306f820
Bits set injellyfish::Target::Init @ 0x1d60fc20
Per-gen defaultTpuVersion == 5 (ShouldEnableConcurrentSparseCoreOffloading @ 0x1d6b6f80)
Source fileplatforms/xla/service/jellyfish/lowering/backend_config_util.cc
ConfidenceCONFIRMED (byte-anchored) unless a row says otherwise

The Resolver — GetSparseCoreConfig @ 0x1c868d20

The resolver takes the async-start HLO and returns a SparseCoreConfig by value (caller-provided sret buffer). It is not a pointer accessor: it copy-constructs a fresh proto, so the consumer's offload/has-bit reads are on a private copy that outlives the instruction's backend-config arena.

// backend_config_util::GetSparseCoreConfig(this_sret, async_start)  @ 0x1c868d20
SparseCoreConfig *GetSparseCoreConfig(SparseCoreConfig *this, const HloInstruction *a2) {
    if (!a2) { SparseCoreConfig::SparseCoreConfig(this, /*arena=*/0); return this; }  // null → default
    CHECK(a2->opcode() == kAsyncStart);                          // [a2+0xc] == 17  (src line 472)
    // thread-name guard (src line 475):
    auto t = a2->async_execution_thread();
    CHECK(t == "sparsecore"                                      // len 10 "sparsecore"
       || t == kSparseCoreOffloadCandidateThread);               // len 28
    auto cfg = a2->backend_config<jellyfish::BackendConfig>();    // StatusOr<BackendConfig>
    if (cfg.ok()) {
        const SparseCoreConfig *src = cfg->sparse_core_config();
        if (!src) src = &SparseCoreConfig_globals_;               // default-instance fallback
        SparseCoreConfig::SparseCoreConfig(this, /*arena=*/0, src);// copy-construct
    } else {
        SparseCoreConfig::SparseCoreConfig(this, /*arena=*/0);     // parse failed → default
    }
    /* unref the StatusOr's status rep, destroy the BackendConfig temp */
    return this;
}

Two guards bracket the parse:

  • Opcode guard. The instruction must be kAsyncStart (opcode 17 = 0x11). The CHECK message is "async_start->opcode() == HloOpcode::kAsyncStart" (src line 472).
  • Thread-name guard. The async-execution thread must be "sparsecore" (the 10-byte kSparseCoreThread, compared via the immediate 0x6F63657372617073 = "sparsec" + 0x6572 = "or" low bytes) or the 28-byte kSparseCoreOffloadCandidateThread. This is the SIMD vptest compare in the disassembly; the CHECK message names both threads (src line 475).

NOTE — the resolver returns a default, never a null pointer. On a null instruction, a non-kAsyncStart opcode (in the MayAdd* callers it is pre-filtered, not CHECKed), a parse failure, or a BackendConfig with no sparse_core_config sub-message, the result is a default-constructed SparseCoreConfig — all has-bits clear. The consumer's has_offload() test (+0x10 & 4) is therefore the real gate: a default proto has the bit clear and classifies as "no SC resource". A reimplementation must not treat the resolver as fallible at the call site; the fallibility is folded into the cleared has-bits.


The SparseCoreConfig Proto — Field Map

GetSparseCoreConfig returns the full SparseCoreConfig; only offload (field 2) is read by the scheduler/reservation classifiers, but the complete field map (from SparseCoreConfig::_InternalSerialize @ 0x1d6dfae0) fixes the object layout and confirms offload lives at object +0x24 behind has-bit +0x10 & 4.

fieldnameobject offhas-bit (obj+0x10)proto type
1tiling+0x20 (i32)0x2enum .xla.jellyfish.Tiling
2offload ← the SC op-type enum+0x24 (i32)0x4enum .xla.jellyfish.Offload
3comp_env+0x18 (msg)0x1msg ScCompilationEnvironment
4enable_megacore+0x2c (bool)0x10bool
5hbm_bandwidth_adjustment_factor+0x28 (f32)0x8float (fixed32)
6function_mode+0x2d (byte)0x20enum/bool
7dedup_id+0x30 (i64)0x80int64
8enable_program_barrier+0x2e (bool)0x40bool
9load_dat+0x38 (bool)0x100bool

The offload FieldDescriptorProto is carved byte-exact: tag 0a07 ("offload", 7 bytes), 1802 (number = 2), 2001 (label = optional), 280e (type = TYPE_ENUM = 14), 3216 (.xla.jellyfish.Offload, 0x16 bytes). The descriptor type strings .xla.jellyfish.Offload, .xla.jellyfish.Tiling, and the field names enable_megacore / hbm_bandwidth_adjustment_factor / function_mode / dedup_id / enable_program_barrier / load_dat are all present in .rodata.

GOTCHA — offload is a backend-config enum, not a custom-call target or MLIR op kind. The op-type classification keyed by GetSparseCoreConfig is a proto field on the instruction's backend config. This is a different mechanism from the plain SparseCoreAsyncTracker, which keys on the custom-call target string ("AllToAllDynamic") via a separate enum (SparseCoreOperationType) — see The Plain Tracker Keys on a Target Name, Not offload. Do not conflate the two: offload (this page) and SparseCoreOperationType are distinct enums with distinct value spaces and distinct consumers.


The xla::jellyfish::Offload Enum

Decoded byte-exact from the EnumDescriptorProto (each value is a 10 NN EnumValueDescriptorProto). All nine enumerator name strings (OFFLOAD_UNSPECIFIED .. OFFLOAD_COMPUTE) are present in .rodata.

valueenumeratorsemantic
0OFFLOAD_UNSPECIFIEDunset / default
1OFFLOAD_EMBEDDINGembedding lookup/update offload
2OFFLOAD_GATHERgather op-class
3OFFLOAD_SCATTERscatter op-class
4OFFLOAD_COLLECTIVEcollective (recurse into wrapped op)
5OFFLOAD_DATA_FORMATTINGdata-formatting op-class
6OFFLOAD_KERNELgeneric SC kernel op-class
7OFFLOAD_SORTsort op-class
8OFFLOAD_COMPUTEcompute offload

Scheduler Consumer — MayAddSparseCoreResource @ 0x11000480

TpuAsyncTracker::MayAddSparseCoreResource is the producer that turns an offloaded async op into scheduler resource ids. It first re-applies the same thread guard (async + thread "sparsecore" / kSparseCoreOffloadCandidateThread, else return), calls GetSparseCoreConfig, then switches on offload only if the has-bit is set. The switch index is offload directly (the add 0xfffffffe / jump table covers values {2..7}, i.e. enum − 2 after the table base).

// MayAddSparseCoreResource(this, async_start, &out)  @ 0x11000480  (condensed)
SparseCoreConfig cfg; GetSparseCoreConfig(&cfg, async_chain_start(async_start));
int usage = /* 2 = kResourceOccupy on start, 1 = kResourceRelease on done */;
if (cfg.has_bits & 0x4) {                       // v55 & 4  → has_offload()
    switch (cfg.offload) {                      // v56 = [&cfg + 0x24]
        case 2: out.push_back({23, usage}); break;  // OFFLOAD_GATHER          → kSparseCoreGather
        case 3: out.push_back({24, usage}); break;  // OFFLOAD_SCATTER         → kSparseCoreScatter
        case 4: MayAddSparseCoreResource$_0(&out,    // OFFLOAD_COLLECTIVE      → recurse on
                  async_wrapped_instruction(async_start), usage); break;  //    async-wrapped op
        case 5: out.push_back({25, usage}); break;  // OFFLOAD_DATA_FORMATTING → kSparseCoreDataFormatting
        case 6: out.push_back({26, usage}); break;  // OFFLOAD_KERNEL          → kSparseCoreKernel
        case 7: out.push_back({27, usage}); break;  // OFFLOAD_SORT            → kSparseCoreSort
        default: break;                             // 0/1/8 → no {23..27} arm
    }
}
// independent of the offload switch — the general per-core SparseCore resource (id 22):
if (this[+0x13b] == 1) {                         // a1+315 — "per-core" gate
    for (i = 0; i < GetNumSparseCoresUsed(async_chain_start(async_start)); i++)
        out.push_back({22, usage});              // id 22 once per used SC core
} else {
    out.push_back({22, usage});                  // id 22 once
}
offloadenumeratorscheduler arm (MayAddSparseCoreResource, idx enum − 2)
0OFFLOAD_UNSPECIFIEDno arm — id 22 path only
1OFFLOAD_EMBEDDINGno arm — id 22 path only
2OFFLOAD_GATHERid 23 kSparseCoreGather
3OFFLOAD_SCATTERid 24 kSparseCoreScatter
4OFFLOAD_COLLECTIVErecurse into async_wrapped_instruction ($_0 @ 0x110008e0)
5OFFLOAD_DATA_FORMATTINGid 25 kSparseCoreDataFormatting
6OFFLOAD_KERNELid 26 kSparseCoreKernel
7OFFLOAD_SORTid 27 kSparseCoreSort
8OFFLOAD_COMPUTEno arm — id 22 path only

The resource ids {22..27} are the SparseCore engine classes in the 47-id scheduler ResourceType enum; see ResourceType Taxonomy for their concurrency caps and hazard classes.

QUIRK — OFFLOAD_COLLECTIVE recurses, it does not emit a lane. Value 4 is the one arm that does not push a kSparseCore* id. Instead it re-enters MayAddSparseCoreResource (the $_0 lambda at 0x110008e0) on the async-wrapped instruction, so a collective wrapped inside an SC-offload async op is classified by the wrapped op's own offload field. A reimplementation that maps value 4 to a resource id will double-count or mis-throttle wrapped collectives.

NOTE — the id-22 general path is independent of offload. The per-core kSparseCore (id 22) emission runs after the offload switch and is gated by a separate byte (this+0x13b == 1), which selects "one id-22 per used SC core" (GetNumSparseCoresUsed) vs "one id-22". So even OFFLOAD_UNSPECIFIED/EMBEDDING/COMPUTE ops — which hit no {23..27} arm — still consume the general SparseCore resource. The offload enum refines which sub-engine; id 22 is the always-present per-core occupancy.


Reservation Consumer — GetSparseCoreResources @ 0x10fdc0a0

The reservation-map twin reads the same SparseCoreConfig.offload field (same +0x24 value, same +0x10 & 4 has-bit), but indexes with enum − 1 (a dec then cmp 6), so its live range is {1..7} — it additionally covers OFFLOAD_EMBEDDING (value 1) in its own arm.

// (anon)::GetSparseCoreResources(async_start)  @ 0x10fdc0a0  (condensed)
SparseCoreConfig cfg; GetSparseCoreConfig(&cfg, async_start);
if ((cfg.has_bits & 4) != 0) {                  // v29[16] & 4  → has_offload()
    switch (cfg.offload) {                      // index = enum, table base enum−1, range {1..7}
        case 1: /* OFFLOAD_EMBEDDING → embedding/general reservation arm */ ...
        case 2: /* OFFLOAD_GATHER          → kSparseCoreGather   */ ...
        case 3: /* OFFLOAD_SCATTER         → kSparseCoreScatter  */ ...
        case 4: /* OFFLOAD_COLLECTIVE      → collective arm      */ ...
        case 5: /* OFFLOAD_DATA_FORMATTING → kSparseCoreDataFmt  */ ...
        case 6: /* OFFLOAD_KERNEL          → kSparseCoreKernel   */ ...
        case 7: /* OFFLOAD_SORT            → kSparseCoreSort     */ ...
    }
}

GOTCHA — scheduler indexes enum − 2, reservation indexes enum − 1. The two consumers read the identical proto field at the identical offset but with a one-off difference in jump-table base. The scheduler (MayAddSparseCoreResource) starts its dense table at value 2 (OFFLOAD_GATHER), so OFFLOAD_EMBEDDING (1) hits no scheduler arm; the reservation map (GetSparseCoreResources) starts at value 1, so it does reserve for embedding. A reimplementer must keep both index bases: do not assume the two classifiers share a switch.

See SC Queue Assignment & Reservation for the reservation-map's resource→limit structure.


The SC-Offload Gate Bits

The SparseCore-offload scheduler sub-pass runs only when SparseCoreCompiler::RunHloScheduler (@ 0x1306f820) finds the gate predicate true. Two of its conjuncts are Target bitfield reads:

// SparseCoreCompiler::RunHloScheduler gate  @ 0x1306f820  (object offsets in bytes)
runSC =  TpuChipConfig::Megachip( Target[+0x3b8][+0x18] )                 // @0x1306f84c
      && *(int*)( Target[+0x3b8] + 0x94 ) > 0       // CoresPerChip(kSparseCore) > 0  @0x1306f863
      && ( (Target[+0x628] & 4) != 0  ||  Target[+0x540] != 0 )           // the two gate bits
      && offloader_util::ModuleContainsLEMSparseCoreInstruction(M)        // @0x1306fbc8
      && FLAGS_xla_sc_enable_latency_hiding_scheduler;                    // @0x1306fc04

In the decompile this is (*((_BYTE*)this + 1576) & 4) != 0 || *((_BYTE*)this + 1344) — byte 1576 = 0x628, byte 1344 = 0x540. The *(int*)(... + 148) > 0 is the 0x94 CoresPerChip(SC) read. The whole predicate appears twice (the eager check and the SC-path re-test).

Where the bits are set — Target::Init @ 0x1d60fc20

Both fields are written inside jellyfish::Target::Init. The register r12/v342 is the Target* being initialized; v98/*v98 is the first scalar of the TpuTopology (the platform-type enum).

// jellyfish::Target::Init  @ 0x1d60fc20  (relevant writes)
Target[+0x540] = (TpuTopology[+0] == 2);     // platform_type == 2   (decompile: _R12+1344)
Target[+0x541] = (TpuTopology[+0] == 1);     // platform_type == 1   (decompile: _R12+1345)
// inside the predicate-gated config-append loop:
Target[+0x628] |= 1;                         // bit-0  (config sub-field A has-bit)  @0x1d611d52
Target[+0x628] |= 4;                         // bit-2  (SC-offload-capability has-bit) @0x1d612121
  • Target[+0x540] is a bool = (TpuTopology[+0] == 2). The first TpuTopology scalar is the internal platform-type enum; value 2 is the iss (simulator) path (platform_type == 1 lands in the sibling byte Target[+0x541]). So Target[+0x540] != 0 force-takes the SC path on the simulator regardless of the capability bit.
  • Target[+0x628] is a _has_bits_-style qword (decompile *((_QWORD*)_R12 + 197) — qword index 197 = byte 0x628). Bit-2 (mask 0x4) is OR'd in inside an unrolled config-append loop that is itself gated by the SC-offload feature-detect; it is the SC-offload-capability has-bit, set for the eligible (newest-gen) part. Bit-0 (mask 0x1) is OR'd earlier in the same loop for a sibling config sub-field. The gate predicate (Target[+0x628] & 4) == 0 → read Target[+0x540] is replayed verbatim inside Target::Init itself (combined with Megachip ∧ CoresPerChip(SC) > 0) — the SC-offload feature-detect.
gate bitobject offmeaningset in Target::Init
Target[+0x628] & 4 (bit-2)+0x628 qwordSC-offload-capability has-bit (per-gen, predicate-gated)OR'd |= 4 @ 0x1d612121
Target[+0x540]+0x540 byteplatform_type == 2 (iss/simulator)= (TpuTopology[+0] == 2) @ 0x1d610b1b
Target[+0x541]+0x541 byteplatform_type == 1 (sibling, not in gate)= (TpuTopology[+0] == 1) @ 0x1d610b29
Target[+0x628] & 1 (bit-0)+0x628 qwordsibling config-append has-bit (not in gate)OR'd |= 1 @ 0x1d611d52

NOTE — the exact proto sub-field naming bit-2 was not isolated. The bit-set site, mask, and value are byte-exact, and the bit sits in the predicate-gated config-append loop alongside bit-0 and two SSO strings (object +0x580/+0x5f0). But the single descriptor entry that names this SC-offload-capability sub-field (a nested config field copied from the chip's vector_isa/TpuSequencerParts) was not pinned to one descriptor. The role — an SC-offload-capability flag the scheduler gate reads — is byte-exact regardless. Confidence: bit position CONFIRMED; sub-field proto name INFERRED.

GOTCHA — platform_type enum order is descriptor-string order. TpuTopology[+0] is the topology's first scalar = the internal TpuPlatformType enum (per the ValidateArgs(TpuPlatformType, …) signature and the proven TpuTopology[+0x8] = TpuChipParts* layout). The gate compares it == 2; the value→name pairing {0 hardware, 1 grm, 2 iss} is taken from descriptor-string order (TpuPlatformTypeToProto = type + 1), not a separately decoded platform_type() getter. Confidence: the == 2 comparison and its gate role CONFIRMED; the enum value→name pairing INFERRED.


Per-Generation Default Basis — TpuVersion == 5

The two SC-offload-concurrency knobs that feed the scheduler default to enabled on exactly one chip generation. Both compute their hardware default as TpuChipParts[+0] == 5 (i.e. tpu::TpuVersion == 5), then let a TpuCompilationEnvironment AutoOr<bool> flag override via the 0x100 "is-set" bit.

// ShouldEnableConcurrentSparseCoreOffloading(tce_view, topo, b)  @ 0x1d6b6f80
hw_default = (TpuChipParts[+0] == 5) & ~b;          // *(_DWORD*)(topo+8) == 5  → TpuVersion 5
flag = tce[+0x458] ? tce[+0x458] : &AutoProto_globals_;   // a1 + 1112 = 0x458
v = AutoOr<bool>::FromProtoOrDie(flag);
return (v & 0x100) ? /*flag set → use flag value*/ : hw_default;

// EnableSparseCoreOffloadQueuingInLhs(tce_view, topo)  @ 0x1d6b81e0
hw_default = (TpuChipParts[+0] == 5);               // *(_DWORD*)(topo+8) == 5
flag = tce[+0x730] ? tce[+0x730] : &AutoProto_globals_;   // a1 + 1840 = 0x730
... same AutoOr<bool> 0x100 override ...

TpuChipParts[+0] is the TpuVersion (the 0-based chip-generation enum); TpuChipParts::ToProtoTpuVersionToProto(v) = v + 1 (confirmed: the decompiled body is literally return v + 1). The codename table from TpuVersionFromString (@ 0x20b3a5a0, init-list @ 0x220117b0):

TpuVersion (internal)codenameproto value (= internal + 1)
0jellyfish1
1dragonfish2
2pufferfish3
3viperfish4
4ghostlite5
56acc604066 ← SC-offload concurrency default ON

All six codenames are present verbatim in .rodata (init-list off_220117B0; the value-5 string "6acc60406" lives at .rodata VMA 0x863f0cf, len 9, paired with TpuVersion == 5). See TPU Version Codename Matrix for the full generation map.

NOTE — the override flag field numbers were not decoded. Both knobs read an AutoOr<bool> at TCE _impl_ offsets 0x458 (concurrency) / 0x730 (offload-queuing-in-LHS), falling back to AutoProto_globals_ when unset; the offsets and the 0x100-bit override are byte-exact, but the two proto field numbers (the _InternalSerialize tags at those offsets) were not isolated. Confidence: offsets + override mechanism CONFIRMED; field numbers PARTIAL. See TpuCompilationEnvironment and TCE Field Offsets & Defaults.


The Plain Tracker Keys on a Target Name, Not offload

Not every SparseCore scheduler keys on GetSparseCoreConfig. When the gate holds, the SC-offload schedule is produced by a plain SparseCoreAsyncTracker first, and that tracker classifies async-schedulable ops by opcode + custom-call target name, not the offload backend enum. This is the cleanest way to see that offload is one of two distinct SC classification mechanisms in the binary.

// SparseCoreAsyncTracker::IsSupportedAsyncStart(hlo)  @ 0x134964c0
bool IsSupportedAsyncStart(const HloInstruction *h) {
    int op = h->opcode();                          // [h+0xc]
    if (op == 12) return true;                     // 0xc  all-to-all
    if (op == 17) return true;                     // 0x11 async-start
    if (op == 49)                                  // 0x31 custom-call
        return SparseCoreOperationTypeFromString(h->custom_call_target()) == 8;  // "AllToAllDynamic"
    return false;
}
// IsSupportedAsyncDone @ 0x13496520 is identical except op 16 (0x10 async-done) replaces 17.

SparseCoreOperationTypeFromString (@ 0x14b7f060) is a chained EqualsIgnoreCase mapper over a separate enum, SparseCoreOperationType, whose first eight values are confirmed in order: 1 SparseMap, 2 CooToCsr, 3 CooToEll, 4 SparseMapRow, 5 SortLexicographic, 6 ReduceDuplicates, 7 EllToCsr, 8 AllToAllDynamic (then 9 ScSendToTc, 10 ScReceiveFromTc, … continuing well past 8). The plain tracker gates only on == 8 ("AllToAllDynamic") — it overlaps SC all-to-all ops, and (via PostProcessScheduleGraphFindNearestAllToAlls @ 0x13496600) biases the schedule toward them.

classifierkeyed onenumvalues used
MayAddSparseCoreResource (this page)SparseCoreConfig.offload (field 2)xla::jellyfish::Offload{2..7} → ids {23..27} + recurse
GetSparseCoreResources (reservation)SparseCoreConfig.offload (field 2)xla::jellyfish::Offload{1..7}
SparseCoreAsyncTracker::IsSupportedAsync{Start,Done}custom-call target stringSparseCoreOperationType== 8 (AllToAllDynamic)

QUIRK — two enums, one word "SparseCore op type". xla::jellyfish::Offload (a backend-config enum, 9 values, drives resource lanes) and SparseCoreOperationType (a custom-call target-name enum, ≥ 8 values, drives async-schedulability) are easy to conflate because both describe "what kind of SparseCore op this is". They are wholly separate: different namespaces, different value spaces, different read paths, different consumers. GetSparseCoreConfig resolves the first; custom_call_target() + SparseCoreOperationTypeFromString the second.


Worked Example — An OFFLOAD_SCATTER Async Op on a Newest-Gen Part

A SparseCore-offloaded scatter, on a megacore newest-generation (TpuVersion == 5) part:

%sc = async-start(%coo), execution_thread="sparsecore",
        backend_config = { sparse_core_config { offload: OFFLOAD_SCATTER } }
%sc.d = async-done(%sc)

Walking the resolver and the classifiers:

  • Gate. Target::Init set Target[+0x628] |= 4 for the newest-gen part, so (Target[+0x628] & 4) != 0 is true; with Megachip, CoresPerChip(SC) > 0, an LEM SparseCore instruction in the module, and the LHS flag on, SparseCoreCompiler::RunHloScheduler enters the SC-offload sub-passes.
  • Resolve. For %sc, MayAddSparseCoreResource re-checks the thread ("sparsecore", len 10 — passes), calls GetSparseCoreConfig, which copy-constructs the SparseCoreConfig. has_offload() (+0x10 & 4) is set; offload (+0x24) reads 3 (OFFLOAD_SCATTER).
  • Classify. The switch on 3 emits {24, kResourceOccupy} — scheduler resource id 24 (kSparseCoreScatter) — on the async-start, and the matching {24, kResourceRelease} on %sc.d. Independently, the id-22 path emits the general per-core kSparseCore occupancy.
  • Throttle. The scheduler caps concurrent id-24 ops at GetNumAvailableResources(24) (the ..._scatter_overlap_limit TCE knob) and treats id 24 as unsharable (hazard 0) — see ResourceType Taxonomy.
  • Default-instance contrast. Had the backend config omitted sparse_core_config, GetSparseCoreConfig would return the default instance, has_offload() would be clear, and %sc would consume only the general id-22 lane — no scatter-specific throttle.

This is exactly why offload is the SC op-type classifier: it is the one proto field that refines a generic SparseCore async op into a specific engine-class resource the scheduler can model.


Confidence Summary

ClaimEvidence
GetSparseCoreConfig returns a copy-constructed SparseCoreConfig (globals fallback)0x1c868d20: SparseCoreConfig::SparseCoreConfig(this, 0, src), SparseCoreConfig_globals_
Thread guard: "sparsecore" (len 10) ∨ kSparseCoreOffloadCandidateThread (len 28); opcode kAsyncStart0x1c868d20 SIMD compare + CHECK strings (src 472/475)
offload = field 2, enum .xla.jellyfish.Offload, object +0x24, has-bit +0x10 & 4_InternalSerialize @ 0x1d6dfae0; FieldDescriptorProto carve; consumer reads
Offload enum OFFLOAD_UNSPECIFIED 0 .. OFFLOAD_COMPUTE 8 (9 values)EnumDescriptorProto @ 0xbfa1f9f; all 9 strings in .rodata
Scheduler arm map {2→23, 3→24, 4→recurse, 5→25, 6→26, 7→27}, idx enum − 2MayAddSparseCoreResource @ 0x11000480 switch (decompiled)
Reservation map reads same field, idx enum − 1, range {1..7} (covers EMBEDDING)GetSparseCoreResources @ 0x10fdc0a0: v29[16] & 4, case 1..7
id-22 (kSparseCore) path is independent of offload, gated this+0x13b == 10x11000480: post-switch loop on GetNumSparseCoresUsed
Gate predicate (Target[+0x628] & 4) ∨ Target[+0x540] plus Megachip/CoresPerChip(SC)/LEM/flagRunHloScheduler @ 0x1306f820: this+1576 & 4, this+1344, +148 > 0
Target[+0x540] = (TpuTopology[+0] == 2); Target[+0x541] = (== 1)Target::Init @ 0x1d60fc20 lines _R12+1344/+1345 = *v98 == 2/1
Target[+0x628] |= 4 (bit-2) and |= 1 (bit-0) in predicate-gated config loopTarget::Init: v342+197 = … | 4 / | 1 (qword 197 = 0x628)
The proto sub-field that names bit-2 (SC-offload-capability)bit-set site byte-exact; descriptor entry not isolated
platform_type value→name pairing {0 hardware, 1 grm, 2 iss}descriptor-string order + ToProto = type+1; == 2 comparison byte-exact
Per-gen default TpuVersion == 5; AutoOr<bool> 0x100 override at TCE +0x458/+0x7300x1d6b6f80 (*(topo+8)==5, a1+1112); 0x1d6b81e0 (a1+1840)
TCE override field numbers for the two SC-offload knobsoffsets + override byte-exact; field numbers not decoded
TpuVersion 0..5 = jellyfish/dragonfish/pufferfish/viperfish/ghostlite/6acc60406; proto = +1TpuVersionFromString init-list @ 0x220117b0 (v5 string "6acc60406" @ .rodata 0x863f0cf); TpuVersionToProto body v+1
Plain SparseCoreAsyncTracker keys on opcode {0xc, 0x11/0x10, 0x31} + SparseCoreOperationType == 8IsSupportedAsyncStart/Done @ 0x134964c0/0x13496520; FromString @ 0x14b7f060
SparseCoreOperationType values 1..8 = SparseMap..AllToAllDynamic0x14b7f060 chained EqualsIgnoreCase (decompiled, in order)

Cross-References