Resource Enum
Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel. Other versions differ.
Abstract
The TPU cost model models a VLIW bundle as a vector of functional-unit occupancies. ResourceVector is a flat array of 23 double slots, one per abstract hardware resource: the two MXU pipes (matmul-issue and gain-push), the cross-lane unit, the two-plus-one vector ALU lanes, the vector EUP / load / store ports, four memory-transfer terms, six ICI ring links, and three SparseCore engines. Each LLO instruction deposits its throughput cycles into the slot named by CycleTable::GetResource, and the bundle's issue cost is a structured reduction — MaxResourceCycles — over the whole vector. This page recovers the enum value→name table, the per-gen slot counts, and the reduction's overlap model.
The reader should not confuse this with the scheduler's ResourceType enum. They are two different "resource" abstractions in the same binary, and conflating them is the central trap. The cost-model Resource (this page, 23 slots, an enum nested in ResourceVector) measures cycle weight — how many cycles a unit is busy. The scheduler ResourceType (ResourceType Taxonomy, 47 IDs, an AsyncTracker enum) measures concurrency limits — how many async collectives of a kind may co-issue. They overlap semantically (both name the six ICI links), but they are distinct enums with distinct value spaces, distinct print paths, and distinct consumers. This page documents the cost-model Resource; the contrast with ResourceType is drawn at the end.
The names come from one place: ResourceVectorToString @ 0x1c89bde0 reads the 22 leading doubles in physical-offset order and formats them with a single printf template, so the string order is the enum order. The 23rd slot (R[22]) is a real, Acc-writable, MaxResourceCycles-read slot, but ToString never prints it — it has no name string in the binary.
For reimplementation, the contract is:
- The 23-slot
Resourceenum value→name→offset table and theAccbound that fixes the count at 23. - The
MaxResourceCyclesreduction: which slots overlap (plain MAX), which blend at 50% (the vector-ALU port balance), and which serialize (the memory group). - The per-gen slot counts (PF=20, VF=28, GL=31 from named
kResourcessymbols; a fourth 31-slot table is present but its gen attribution is unconfirmed) and how the wider gens extend the enum pastR[22]. - The gen-invariant
CycleTable::GetResourceop→slot table. - The boundary between this enum and the scheduler's
ResourceTypeenum.
| Enum | xla::jellyfish::ResourceVector::Resource (nested) |
| Slot count | 23 (Acc bound < 0x17); ToString names 22 |
| Vector layout | 23 × double (+0x00..+0xb0) + operand-bytes map +0xb8 + output-bytes map +0xd8 |
| Name source | ResourceVectorToString @ 0x1c89bde0 (format string @ 0x87f40ea) |
| Deposit | ResourceVector::Acc(Resource, double) @ 0x1c89adc0 |
| Reduction | ResourceVector::MaxResourceCycles() @ 0x1c89b9e0 |
| Op→slot map | CycleTable::GetResource(Instruction) @ 0x1c89ce20 (table @ 0xb438aec) |
| Per-gen counts | PF=20, VF=28, GL=31 (named symbols); GF=31 (inferred, UNVERIFIED) (kResources permutations) |
The 23-Slot Resource Enum
Names
ResourceVectorToString @ 0x1c89bde0 feeds the 22 leading doubles into the format string at .rodata 0x87f40ea in physical-offset order. The decompile confirms the full template byte-for-byte:
RV[Matpush: %.0f, Matmul: %.0f, Xlu: %.0f, VectorAlu0: %.0f, VectorAlu1: %.0f,
VectorAluAny: %.0f, VectorEup: %.0f, VectorLoad: %.0f, VectorStore: %.0f,
MemXferInputLatency: %.0f, MemXferInputBandwidth: %.0f, MemXferOutputLatency: %.0f,
MemXferOutputBandwidth: %.0f, IciYPlus: %.0f, IciYMinus: %.0f, IciXPlus: %.0f,
IciXMinus: %.0f, IciZPlus: %.0f, IciZMinus: %.0f, ScScs: %.0f, ScTile: %.0f,
ScCollective: %.0f]
Since each double is an 8-byte slot read in order, the enum value equals the array index equals offset / 8:
| idx | offset | name | functional unit |
|---|---|---|---|
| R[0] | +0x00 | Matpush | MXU gain/latch-push (matpush) pipe |
| R[1] | +0x08 | Matmul | MXU matmul-issue (matprep) pipe |
| R[2] | +0x10 | Xlu | cross-lane unit (matres read / EUP / transcendental staging) |
| R[3] | +0x18 | VectorAlu0 | vector ALU lane 0 (dedicated) |
| R[4] | +0x20 | VectorAlu1 | vector ALU lane 1 (dedicated) |
| R[5] | +0x28 | VectorAluAny | vector ALU "any" (load-balanced) lane |
| R[6] | +0x30 | VectorEup | vector extended-precision unit |
| R[7] | +0x38 | VectorLoad | vector load port |
| R[8] | +0x40 | VectorStore | vector store port |
| R[9] | +0x48 | MemXferInputLatency | input-DMA startup-latency term |
| R[10] | +0x50 | MemXferInputBandwidth | input-DMA per-byte bandwidth term |
| R[11] | +0x58 | MemXferOutputLatency | output-DMA startup-latency term |
| R[12] | +0x60 | MemXferOutputBandwidth | output-DMA per-byte bandwidth term |
| R[13] | +0x68 | IciYPlus | ICI ring link +Y |
| R[14] | +0x70 | IciYMinus | ICI ring link −Y |
| R[15] | +0x78 | IciXPlus | ICI ring link +X |
| R[16] | +0x80 | IciXMinus | ICI ring link −X |
| R[17] | +0x88 | IciZPlus | ICI ring link +Z |
| R[18] | +0x90 | IciZMinus | ICI ring link −Z |
| R[19] | +0x98 | ScScs | SparseCore SCS sequencer |
| R[20] | +0xa0 | ScTile | SparseCore tile-execute core |
| R[21] | +0xa8 | ScCollective | SparseCore collective engine |
| R[22] | +0xb0 | (unnamed; valid) | reserved/scalar — read by MaxResourceCycles, written by Acc; never printed |
ResourceVector::Acc(Resource, double) @ 0x1c89adc0 bounds-checks resource >= 0x17 (23) with a trapping ud1, then does vector[resource] += cycles over an 8-byte stride. So 23 slots are valid (R[0..22]); the print path covers only the first 22. The total vector body is 0xf8 bytes: 23 doubles (+0x00..+0xb0), an operand-bytes flat_hash_map (+0xb8), and an output-bytes flat_hash_map (+0xd8).
GOTCHA —
R[22]is real but anonymous.Accaccepts it (its bound is< 23, not< 22) andMaxResourceCyclesreads[+0xb0]in the plain-MAX group, yetResourceVectorToStringstops at 22 arguments. A reimplementer who sizes the vector at 22 from the printf template alone will buffer-overrun on the first deposit into slot 22 and silently drop a cycle term in the reduction. Size it at 23. Its semantic role (likely a scalar-sequencer / reserved count) has no string evidence (MEDIUM).
The MaxResourceCycles Reduction
MaxResourceCycles @ 0x1c89b9e0 is not a flat maximum — it is three rules layered over the slots, and the layering is the heart of the cost model's overlap assumptions:
function MaxResourceCycles(rv): // sub_1C89B9E0
a = rv[3]; b = rv[4]; c = rv[5] // VectorAlu0, VectorAlu1, VectorAluAny
if c > 0: // (A) 50% port-balance blend
if a > b: d = min(a-b, c); c -= d; b += d // fill the less-busy dedicated lane
elif b > a: d = min(b-a, c); c -= d; a += d
c *= 0.5 // const @0xa2df5c8
a += c; b += c
vec_alu = max(a, b)
mem = rv[9] + rv[10] + rv[11] + rv[12] // (B) MemXfer serial sum
acc = max(vec_alu, mem)
for r in {0,1,2,6,7,8, 13..22}: // (C) plain MAX (independent units overlap)
acc = max(acc, rv[r])
return acc
Three groups:
- (A) 50% blend on
{R[3], R[4], R[5]}— the two-lane vector ALU. "Any"-lane work (R[5]) load-balances across the two dedicated lanes (R[3],R[4]); the residual that cannot balance perfectly overlaps at 50% (the0.5constant). This models a 2-lane ALU, not an XLU blend. - (B) serial sum on
{R[9..12]}— the memory subsystem. Input latency + input bandwidth + output latency + output bandwidth add: a transfer's startup, payload, the store's startup, and its payload happen in sequence. - (C) plain MAX over everything else, including
R[0]=Matpush,R[1]=Matmul, andR[22].
QUIRK —
MatpushandMatmulare in the plain-MAX group, so back-to-back MXU operations in one bundle overlap (the MXU is pipelined). The only thing that serializes is the memory transfer (R[9..12]). A naive cost model that serializes matmuls will over-cost MXU-bound bundles by a large factor on every generation. Trust the MAX.
Add / Scale — the Amortized-DMA Model
Two more reductions handle loop scaling. Add (@ 0x1c89b820) MAX-combines the two DMA latency slots {R[9], R[11]} while every other slot ADDs — you pay a DMA startup once across combined sub-vectors, but bytes-moved accumulates. ScaleResource (@ 0x1c89b6a0) skip-scales those same two latency slots when multiplying a per-iteration bundle by a trip count: bandwidth and compute scale linearly, DMA startup does not. The result is a clean amortized-startup model — latency paid once, bandwidth paid per iteration — which the loop emitter exploits via GetSubset (a prologue subset that keeps only the input-DMA slots, a tail subset that keeps compute + output-DMA). See Bundle-Aware Cost for the loop composition.
Op→Slot Mapping — CycleTable::GetResource
Each LLO instruction's throughput cycles are deposited into one named slot. The op→slot map is a single gen-invariant flat-table lookup — CycleTable::GetResource(Instruction) @ 0x1c89ce20 is literally return table[Instruction] over .rodata 0xb438aec (4 bytes/entry), not a per-gen virtual. The Instruction value is the ~33-bucket collapse of the LLO opcode space (CycleTable Family).
CycleTable::Instruction family | example buckets | → Resource slot |
|---|---|---|
| matprep (matmul-issue) | 0..4 (bf16 / fp8 / int8 matprep) | R[1] Matmul |
| gain-latch / push | 5..16 (bf16/int4/fp8 latches, incl. transposed) | R[0] Matpush |
| matrix-result read + EUP + sin/cos | 23, 27, 28, 29, 30, 31 (matres TC/0/1, eup0/1, sincos) | R[2] Xlu |
| input rotate (RotIn/RotOut) | 18, 19 | R[4] VectorAlu1 |
| shuffle / permute | 20 | R[3] VectorAlu0 |
| broadcast / reduce / cross-lane / tan | 21, 22, 25, 32 | R[5] VectorAluAny |
| extended-precision / lane-compare | 17, 24, 26 | R[6] VectorEup |
RecordHloCycles @ 0x130bbfe0 ties it together: r = GetResource(Instruction) (the named slot, gen-invariant), c = vtable[+0x10](Instruction) (per-gen GetCyclesForThroughput), then Acc(r, c). Memory ops bypass this table — the cost_model_util::RecordMemXfer* family deposits directly into R[9..12].
NOTE — the rotate/shuffle/broadcast/reduce family fans across exactly the three
VectorAluslots that theMaxResourceCycles50%-blend group reduces (R[3]/R[4]/R[5]). That is by design: shuffle pinsVectorAlu0, rotate pinsVectorAlu1, and the symmetric "any" ops (broadcast/reduce/cross-lane/tan) go toVectorAluAnyso the blend can load-balance them.
Per-Gen Slot Counts
The 23-slot enum is the TensorCore ResourceVector print surface, but the per-gen Performance::GetResources()::kResources arrays declare how many slots each gen actually populates — and in what iteration order. Each kResources is a 1-byte-per-entry permutation of the contiguous range [0..N-1], with N = the per-gen resource count. The first three rows are byte-anchored to named symbols (nm -C resolves each address to a …Performance::GetResources() const::kResources symbol); the decoded bytes are permutations of exactly [0..N-1]:
| Gen | kResources address | slot count N | extends past R[22]? |
|---|---|---|---|
| Pufferfish | 0xb43cd94 (named PufferfishPerformance symbol) | 20 | no (no SparseCore / extra-ICI slots) |
| Viperfish | 0xb43cda8 (named ViperfishPerformance symbol) | 28 | yes (adds ICI / Sc sub-slots) |
| Ghostlite | 0xb43cdc4 (named GhostlitePerformance symbol) | 31 | yes |
| 6acc60406 (GF) | 0xb43cde3 (unnamed; inferred) | 31 | yes |
Pufferfish's 20-slot permutation never references R[20]/R[21]/R[22] — it has no SparseCore or extra-ICI resources. The wider gens (VF/GL) extend the enum past R[22] with additional ICI sub-links and SparseCore sub-units; those high-index slots have no name in the TensorCore ResourceVectorToString (it only prints 22). Naming them would require a SparseCore-side print path that this build does not expose through the TensorCore estimator (LOW / not recovered).
UNVERIFIED — the fourth row (
0xb43cde3) is not a symbol-anchored gen.nmresolves exactly threePerformance::GetResources()::kResourcessymbols (PF/VF/GL); the next named symbol after Ghostlite iskProgramSharedRegistryInitialized@0xb43ce10. The 31 bytes at0xb43cde3are a real second permutation of[0..30](immediately following Ghostlite's 31-byte table, both under the single Ghostlite symbol — i.e.kResourcesis a[2][31]array Ghostlite indexes by a sub-version discriminant), but no symbol orTpuVersionreference in this build attributes it to 6acc60406 / TPU7x. The 6acc60406 label is a plausible inference from the adjacency and slot count, not a recovered fact.
Boundary — Cost-Model Resource vs Scheduler ResourceType
The binary has a second, larger "resource" enum that a reimplementer will inevitably meet: ResourceType, the LatencyHidingScheduler's AsyncTracker taxonomy. They are not the same enum and must not be merged.
Cost-model Resource (this page) | Scheduler ResourceType | |
|---|---|---|
| Enum | ResourceVector::Resource | AsyncTracker ResourceType |
| Values | 23 (R[0..22]) | 47 (base {0..12} + target {13..46}) |
| Measures | cycle weight (how busy a unit is) | concurrency cap (how many co-issue) |
| Name source | ResourceVectorToString @ 0x1c89bde0 | GetResourceName @ 0x13616500 (base) / 0x10fff420 (TPU) |
| Consumer | MaxResourceCycles → bundle issue cost | GetNumAvailableResources → co-issue throttle |
| Per-op map | CycleTable::GetResource (LLO opcode → slot) | GetResourceTypeForOp (HLO opcode → rt) |
The two enums overlap on one physical concept — the six ICI ring links. The cost model deposits cycle weight into R[13..18] (Ici{Y,X,Z}{Plus,Minus}); the scheduler caps concurrent issue of the same links via ResourceType ids 14..19 (also kIci{Y,X,Z}{Plus,Minus}), bounded by the xla_tpu_sparse_core_ici_overlap_limit knob (field 1130). A bundle that uses an ICI link both costs cycles (this model) and consumes a concurrency slot (the other model). The scheduler enum, its producers, and its available-count sources are documented in full on the dedicated ResourceType Taxonomy page; the contrast is summarized here only to fix the boundary.
GOTCHA — In the
MaxResourceCyclesreduction, the 0.5 blend on{3,4,5}is theVectorAlu0/1/Anyport-balance group (not an XLU blend), and the serial sum on{9..12}is theMemXfer{Input,Output}{Latency,Bandwidth}memory group (not an MXU group). The MXU pipesR[0]/R[1]are in the plain-MAX group and overlap, not serialize.
Related Components
| Component | Relationship |
|---|---|
| CycleTable Family | Provides GetResource (op→slot) and GetCyclesForThroughput (slot weight) |
| Bundle-Aware Cost | The GetSubset / loop composition that consumes MaxResourceCycles |
| ResourceType Taxonomy | The scheduler's distinct 47-ID concurrency-cap enum |
| Cost Model Overview | The three class families that build the per-gen Performance slot data |
Cross-References
- Cost Model Overview — the factory dispatch and per-gen class table that produce
ResourceVectordata - CycleTable Family — the
Instructionbucket enum and the per-gen throughput cycles deposited into slots - Bundle-Aware Cost —
GetSubsetprologue/tail partition and the software-pipelined loop cost - Per-Opcode Cycle Constants — the per-gen cycle values that fill each slot
- ResourceType Taxonomy — the scheduler's 47-ID
AsyncTrackerresource model (concurrency, not cycle weight) - Scheduler Overview — the
LatencyHidingSchedulerconsumer of both resource models - MXU Slot — the LLO MXU opcodes that map to
Matmul/Matpush/Xluslots