Per-Opcode Cycle Constants
Every offset, value, and address on this page was read byte-exactly from
libtpu.soin thelibtpu-0.0.40-cp314wheel (BuildID md589edbbe81c5b328a958fe628a9f2207d). Other versions differ. All.rodataaddresses are virtual addresses; for this binary.rodataVMA == file offset (section[11]at0x84a0000).
Abstract
This page is the data sheet for the TensorCore cost model: the actual cycle-cost integers baked into libtpu.so's .rodata, grouped by generation and engine, plus the rule by which the bundle-latency cost model sums them. The numbers come from two places that a reimplementer must keep distinct. (1) The throughput cycle of a cycle class is the value a per-gen CycleTable returns from GetCyclesForThroughput — for Jellyfish/Dragonfish a flat read Performance[offsetLUT[class]], for Pufferfish and later a switch that calls Performance::GetResourceUsage(instruction_id, resource). (2) The latency of an op is a separate per-gen Performance array entry the LatencyTable* reads. Both arrays live in the same per-gen Performance object; the cycle constants below are the contents of those arrays.
The oldest gens (JF/DF) are fully byte-pinned here: a 33-entry int64 offset LUT, the 16-of-33 priced subset, the seven 8-cycle MXU cells, the nine 1-cycle vector cells, and the constructor source blocks that fill them. For Pufferfish through 6acc60406, the per-class throughput integers are pinned from the GetResourceUsage(instr,res) call constants in each gen's switch, and the per-gen latency-value distributions are pinned from the constructor store stream; the full per-(instruction × resource) 2D grids are partially extracted (flagged below).
The contract this page documents:
- A cycle class's throughput cost is
Performance[offset](JF/DF) orGetResourceUsage(instr,res)(PF+). The default for an unpriced class is1cycle. - The baked grid defaults to the sentinel
0x7FFFFFFF("unsupported / not schedulable") and is overwritten by the per-genPerformanceconstructor with a hand-crafted broadcast/block-copy pattern. - Dragonfish == Jellyfish except for two cells (
[+0x28],[+0x2c]), neither a throughput-LUT target, so the JF and DF throughput tables are byte-identical. - Bundle issue cost is per-lane max, not a global sum. Per-op cycles accumulate into a
ResourceVectorkeyed onGetResource(class); same-lane costs add, cross-lane costs overlap. - Transcendentals are constants from scalar virtual overrides, not from the grid.
| Offset LUT (JF/DF) | qword_B438B70 @ 0xb438b70 — 33 × int64, class → Performance byte offset |
| Resource LUT | dword_B438AEC @ 0xb438aec — 33 × int32, class → ResourceVector slot |
| Priced mask (JF) | 0x19FFC0821 — 16 classes priced; rest default 1 |
| Sentinel | dword_84A2F60 @ 0x84a2f60 = 0x7FFFFFFF (unsupported) |
| Broadcast scalars | dword_84A2B08 = 1, dword_84A2854 = 2, dword_84A2D0C = 8 |
| PerformanceJf ctor | 0x1d4930c0 — 116 stores → 419 distinct int32 cells |
| PerformanceDf ctor | 0x1d493060 — JF + one 0xD00000042 store at +0x28 |
| Grid width | 0xe00 bytes (Performance::Performance @ 0x1d492900) |
| Confidence | CONFIRMED (byte-anchored) for JF/DF; per-class PF+ CONFIRMED; full 2D grids PARTIAL |
Where The Constants Live
Every per-gen Performance object is one flat 0xe00-byte allocation. The base constructor Performance::Performance @ 0x1d492900 broadcast-fills the entire array with the sentinel 0x7FFFFFFF (vbroadcastss xmm0, dword_84A2F60), meaning "this cell is unsupported / not schedulable." The derived per-gen constructor then overwrites the supported subset with packed constants.
There are three idioms a reimplementer will see in every Performance constructor:
| Idiom | Disassembly | Effect |
|---|---|---|
| Broadcast fill | vbroadcastss xmm, dword_XXX + vmovups [base+off], xmm | writes 4 identical cells |
| 16-byte block copy | vmovaps xmm, xmmword_YYY + vmovups [base+off], xmm | writes 4 cells from a 4-int32 .rodata block |
| scalar / movabs store | mov [base+off], imm / movabs rax,imm64; mov [base+off],rax | writes 1 or 2 cells |
The three broadcast scalars carry the common cycle values: dword_84A2B08 = 1 (1-cycle ops), dword_84A2854 = 2 (2-cycle bucket), dword_84A2D0C = 8 (the 8-cycle MXU bucket). The distinctive per-instruction clusters come from 16-byte blocks; the 15 distinct blocks used by PerformanceJf are enumerated below.
Jellyfish / Dragonfish — The Flat Offset-LUT (byte-exact)
JF/DF do not use a (instruction, resource) switch. JfCycleTable::GetCyclesForThroughput reads the Performance grid directly through a 33-entry int64 offset LUT:
// xla::jellyfish::JfCycleTable::GetCyclesForThroughput @ 0x1c89dce0 (decompiled, exact)
int64_t GetCyclesForThroughput(JfCycleTable *this, uint32_t cls) {
if (((cls < 0x21) & (uint8_t)(0x19FFC0821ULL >> cls)) == 1) // priced?
return *(uint32_t *)(this->performance/*+0x10*/ + qword_B438B70[cls]);
return 1; // default
}
The priced mask 0x19FFC0821 selects 16 of the 33 classes: {0x00, 0x05, 0x0b, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1f, 0x20}. The offset LUT (qword_B438B70 @ 0xb438b70) is 0 for every unpriced class — harmless, since the read is short-circuited. The priced rows, with the byte offset, the resolved value in the rebuilt PerformanceJf image, and the constructor idiom that wrote the cell:
| Class | offsetLUT[cls] | Resource | JF cyc | DF cyc | Source idiom |
|---|---|---|---|---|---|
0x00 | 0x910 | R[1] Matmul | 8 | 8 | block @0xa2da220 {8,8,8,1}[0] |
0x05 | 0x92c | R[0] Matpush | 8 | 8 | bcast 8 (@0x84a2d0c) |
0x0b | 0x92c | R[0] Matpush | 8 | 8 | bcast 8 (shares offset with 0x05) |
0x12 | 0x33c | R[4] VectorAlu1 | 1 | 1 | bcast 1 (@0x84a2b08) |
0x13 | 0x340 | R[4] VectorAlu1 | 1 | 1 | bcast 1 |
0x14 | 0x344 | R[3] VectorAlu0 | 1 | 1 | bcast 1 |
0x15 | 0x39c | R[5] VectorAluAny | 1 | 1 | bcast 1 |
0x16 | 0x398 | R[5] VectorAluAny | 1 | 1 | bcast 1 |
0x17 | 0x954 | R[2] Xlu | 8 | 8 | bcast 8 |
0x18 | 0x3f8 | R[6] VectorEup | 1 | 1 | imm 1 |
0x19 | 0x368 | R[5] VectorAluAny | 1 | 1 | bcast 1 |
0x1a | 0x3f4 | R[6] VectorEup | 1 | 1 | bcast 1 |
0x1b | 0x960 | R[2] Xlu | 8 | 8 | bcast 8 |
0x1c | 0x94c | R[2] Xlu | 8 | 8 | block @0xa2cf810 {8,1,1,8}[3] |
0x1f | 0x958 | R[2] Xlu | 8 | 8 | bcast 8 |
0x20 | 0x39c | R[5] VectorAluAny | 1 | 1 | bcast 1 (shares offset with 0x15) |
The seven 8-cycle cells are the MXU matprep/matmul/matrix-result throughput ports (0x00, 0x05, 0x0b, 0x17, 0x1b, 0x1c, 0x1f); the nine 1-cycle cells are the vector/EUP/cross-lane result stages. Two pairs share an offset (0x05/0x0b → 0x92c; 0x15/0x20 → 0x39c). None of the priced offsets is 0x28 or 0x2c, so the Dragonfish two-cell override never touches a throughput cell.
NOTE — these values were re-derived, not transcribed. The full
PerformanceJfimage was rebuilt by emulating the base sentinel fill plus every constructor store (verifiedvmovaps/vbroadcastss/movabsstream from0x1d4930c0), then read back atoffsetLUT[cls]. The MXU-band stores are explicit in the constructor:[+0x910] ← xmmword_A2DA220 {8,8,8,1},[+0x920..+0x980] ← bcast 8,[+0x940] ← xmmword_A2CF810 {8,1,1,8}. The 16-byte source blocks were read directly from.rodataand all 15 match (see below).
The 15 PerformanceJf constructor source blocks
Each is four int32. Read byte-exact from .rodata:
| Block | Value | Block | Value | Block | Value |
|---|---|---|---|---|---|
@0xa2c2f40 | {5,1,1,1} | @0xa2c5b90 | {1,1,1,4} | @0xa2c5ba0 | {1,1,5,5} |
@0xa2c8a30 | {4,105,7,92} | @0xa2c8a40 | {2,1,1,1} | @0xa2cea00 | {2,2,1,1} |
@0xa2cf810 | {8,1,1,8} | @0xa2d2090 | {1,1,4,4} | @0xa2d2df0 | {1,1,1,2} |
@0xa2d3c30 | {8,1,1,1} | @0xa2d7660 | {1,1,8,1} | @0xa2da220 | {8,8,8,1} |
@0xa2daf10 | {4,2,1,1} | @0xa2db650 | {1,1,2,2} | @0xa2dcd30 | {88,8,4,1} |
The constructor issues 116 store ops (17 block copies + 81 broadcast fills + 9 movabs qwords + 9 dword imms) writing exactly 419 distinct int32 cells with no overlap; the remaining cells stay at the 0x7FFFFFFF sentinel. Only ~31 of the 419 cells are semantically bound (16 throughput-LUT targets + ~15 latency-table cells); the rest are populated but their consumer is not yet mapped (LOW confidence on per-cell role).
The Dragonfish delta
PerformanceDf::PerformanceDf @ 0x1d493060 calls PerformanceJf::PerformanceJf, swaps the vtable, then issues exactly one store:
// PerformanceDf::PerformanceDf @ 0x1d493060 (decompiled, exact)
PerformanceJf::PerformanceJf(this, ids);
*(void **)this = /* Df vtable */;
*((uint64_t *)this + 5) = 0xD00000042ULL; // byte offset 0x28
*((uint64_t*)this + 5) is byte offset 5 × 8 = 0x28; the int64 0x0000000D_00000042 decodes to [+0x28] = 0x42 = 66 and [+0x2c] = 0x0d = 13 (down from JF's 88 / 8). These two cells feed the matmul/matprep base-latency slots of the LatencyTableJellyfish, not the throughput LUT — the cost-model reflection of the v2→v3 MXU-count/clock change. Every other cell is identical (verified by diffing the two rebuilt images: exactly two cells change).
Pufferfish — switch Over GetResourceUsage (per-class CONFIRMED)
From Pufferfish on, GetCyclesForThroughput is a switch over the cycle class; each case names a Performance instruction id and a resource and reads the grid through GetResourceUsage:
// xla::jellyfish::PfCycleTable::GetCyclesForThroughput @ 0x1c89de60 (decompiled, exact shape)
switch (cls) {
case 0: return PufferfishPerformance::GetResourceUsage(perf, 125, 9);
case 1: return PufferfishPerformance::GetResourceUsage(perf, 137, 9);
case 5: case 11: return PufferfishPerformance::GetResourceUsage(perf, 220, 11);
case 6: case 12: return PufferfishPerformance::GetResourceUsage(perf, 223, 11);
case 9: case 15: return PufferfishPerformance::GetResourceUsage(perf, 224, 11);
case 10: case 16: LogFatal("Unsupported PushGainsS4.", /*cycle_table.cc:575*/);
case 23: return PufferfishPerformance::GetResourceUsage(perf, 260, 11);
case 24: return PufferfishPerformance::GetResourceUsage(perf, 107, 3);
case 26: return PufferfishPerformance::GetResourceUsage(perf, 106, 3);
case 27: return PufferfishPerformance::GetResourceUsage(perf, 264, 11);
case 28: return PufferfishPerformance::GetResourceUsage(perf, 244, 11);
case 29: return PufferfishPerformance::GetResourceUsage(perf, 252, 11);
case 31: return PufferfishPerformance::GetResourceUsage(perf, 262, 11);
default: return 1;
}
The (instruction_id, resource) pairs are the primary constants. Resolving each against the Pufferfish Performance latency array gives the per-class cost:
| Class | (instr, res) | Perf value | Role |
|---|---|---|---|
0x00 | (125, 9) | 101 | bf16 matprep |
0x01 | (137, 9) | 101 | bf16 matprep' |
0x05/0x0b | (220, 11) | 7 | bf16 latch |
0x06/0x0c | (223, 11) | 7 | int8 latch |
0x09/0x0f | (224, 11) | 7 | fp8 latch |
0x0a/0x10 | — | fatal | PushGainsS4 (cycle_table.cc:575) |
0x17 | (260, 11) | 69 | matrix-result TC |
0x18 | (107, 3) | 7 | r/w transpose |
0x1a | (106, 3) | 7 | lane compare |
0x1b | (264, 11) | 79 | matrix-result primary |
0x1c | (244, 11) | 126 | matrix-result secondary |
0x1d | (252, 11) | 126 | EUP primary |
0x1f | (262, 11) | 69 | transcendental class |
The resource arguments expose the lane: 9 = matmul-issue (PF only — later gens use 11), 11 = XLU MRB, 3 = XLU input slot. The PushGainsS4 cases were declared in the enum but their cost is intentionally a fatal log, keeping those modes out of the schedulable set.
NOTE — the
(instr, res)value differs from the cross-gen throughput column for the same class.GetResourceUsage(125, 9)returns101(instruction 125's resource-9cell, the bf16-matprep view), whereas the per-class throughput table below lists Pufferfish class0x00as79. The two numbers are different columns of the 2D(instruction × resource)grid; the throughput row used by the scheduler reads a different resource slot than theres=9matmul-issue cell. A reimplementer must not assume the two views are the same number. Both are byte-anchored against the PufferfishPerformancearray; the per-resource cell that the bundle cost model actually consumes is flagged PARTIAL in the 2D-grid section.
Per-Class Throughput Across Generations
Decoded from each gen's GetCyclesForThroughput (or …Helper) switch, cross-referenced against the per-gen Performance latency arrays. These are per-bundle-issue throughput cycles, not per-instruction latency.
| Class | Role | JF/DF | Puff | Vip | Glite | 6acc60406 |
|---|---|---|---|---|---|---|
0x00 | Vector matprep, bf16 | 8 | 79 | 131 | 192 | 212 |
0x05 | Latch, bf16 | 8 | 79 | 131 | 192 | 212 |
0x0b | Transposed bf16 latch | 8 | 79 | 131 | 192 | 212 |
0x09 | Latch, fp8 | (dflt 1) | — | 114 | 192 | 204 |
0x12–0x16 | XLU rot / shuffle / bcast / reduce | 1 | 1 | 1 | 1 | 2 |
0x17 | Matrix-result read (TC) | 8 | 53 | 114 | 192 | 212 |
0x18/0x19/0x1a | RW-xpose / cross-lane / lane-cmp | 1 | 1 | 1 | 1 | 1 |
0x1b | Matrix-result, primary | 8 | 79 | 115 | 122 | 127 |
0x1c | Matrix-result, secondary | 8 | 30 | 30 | 49 | 49 |
0x1d | EUP unary primary | — | 1 | 1 | 13 | 10 |
0x1e | EUP unary secondary | — | — | — | 13 | 11 |
| sin/cos (scalar) | transcendental estimate | 198 | 198 | 154 | 142 | 142 |
| tan (scalar) | transcendental estimate | 219 | 219 | 170 | 151 | 151 |
— means the gen does not implement that class (the switch falls through to default 1). The matmul/matprep clusters match the per-format MXU cycles: bf16 Vf=131 / Gl=192 / Gf=212; fp8 Vf=114–115 / Gl=192 / Gf=204; fp32 shares bf16. These cross-check against the matmul mode modifiers and the per-gen MXU latency tables (JF/DF, and the per-gen Performance pages PF, VF).
Transcendental scalar costs (virtual overrides, CONFIRMED)
EstimateSinCosCost/EstimateTanCost are per-gen const virtuals returning a fixed estimate regardless of operand size:
| Gen | SinCos | Tan | Functions |
|---|---|---|---|
| Jellyfish / Dragonfish | 198 | 219 | 0x1c89dd20 / 0x1c89dd40 |
| Pufferfish | 198 | 219 | 0x1c89dfc0 / 0x1c89dfe0 |
| Viperfish | 154 | 170 | 0x1c89e480 / 0x1c89e4a0 |
| Ghostlite | 142 | 151 | 0x1c89ea00 / 0x1c89ea20 |
6acc60406 | 142 | 151 | 0x1c89f0e0 / 0x1c89f100 |
Each value was read from the function body (return <imm>;). The trend (sin/cos shrinking ~25 % from PF to GF) tracks the XLU pipeline / issue-throughput speedups across gens.
Per-Gen Latency-Array Distributions (CONFIRMED counts; full 2D PARTIAL)
The per-gen Performance object also holds the latency array the LatencyTable* reads. The value histograms below were extracted by parsing every mov dword [base+off], imm in each constructor. Constructor addresses and array sizes:
| Gen | Performance class | Ctor | num_instr | latency bytes | resource_count |
|---|---|---|---|---|---|
| Pufferfish | xla::pufferfish::PufferfishPerformance | 0x1c8be080 | 336 | 1344 | 20 |
| Viperfish | xla::viperfish::ViperfishPerformance | 0x1c8c4840 | 384 | 1536 | 28 |
| Ghostlite | xla::ghostlite::GhostlitePerformance | 0x1c8cbc80 | 476 | 1904 | 31 |
6acc60406 | (gxc::gfc Performance, unnamed sub_1C8D3740) | 0x1c8d3740 | 465 | 1860 | 31 |
Notable clusters in the latency histograms (full per-value tables omitted for brevity): Pufferfish concentrates at 1 (144 entries), 83/101 (48 each), 126 (16); Viperfish at 1/2 (147/113), 121 (25), 131 (27); Ghostlite at 1/2 (149/181), 182 (24), 192 (27); 6acc60406 at 1/2 (154/198), 204 (12), 212 (15). The 200+ cycle entries cluster at instruction ids 289..330 — the bf16/fp8 MXU latch/matmul/matres ops.
PARTIAL — the full
(instruction × resource)2D grids are not enumerated here. EachPerformanceconstructor issues 599 (PF) / 759 (VF) / 831 (GL) / 759 (GF) dword stores split between the latency array and the 2Dresource_usagearray; this page pins the per-class throughput constants (the named switch cases, CONFIRMED) and the latency-value distributions (CONFIRMED counts) but not every per-resource cell. Extraction is mechanical againstPerformance::GetResources()::kResourcesand is left as a follow-up. The per-genPerformancepages (PF, VF) track this.
How The Bundle-Latency Cost Model Sums Them
The cycle constants feed a two-layer accounting (the design is detailed in bundle-aware cost):
- Op level. For each LLO op, the emitter computes its cycle class (
CycleTableInstructionfor MXU ops; emitterRecord*paths for vector/memory ops), readsGetCyclesForThroughput(class), and adds(double)cyclesinto the op'sResourceVectorat slotGetResource(class). This is theAccumulateInstructionUsage→ResourceVector::Accpath ([rdi + Resource*8] += cycles). - Bundle level. A VLIW bundle is one issue slot from the rate perspective, but multiple ops share it. The scheduler combines per-op resource vectors so that same-resource costs add (sequential on that functional unit) and different-resource costs overlap (the bundle's contribution is the per-lane max). The slowest occupied lane determines the bundle's issue cost.
Memory-transfer ops are formula-based rather than a single constant: the per-byte cycle is bytes_per_window / (lane_count × dtype_bytes) × per-lane-cycle, composed in the cost_model_util::Record*MemXferCycles family. Matmul cost is K_tile × per-(format, mxu) cycle. Transcendentals add the scalar EstimateSinCosCost() / EstimateTanCost() on top of a per-step pipeline count. The absolute-time conversion uses Target::TensorCoreFrequencyInMegaHertz() (cycles → microseconds via the 1.0e6 factor at qword_A2E0208); the cost model itself works purely in cycles.
NOTE — the shipped cost model is data-table driven, not ML. A
LearnedCostModelClientOptionsproto exists for configuration, but noLearnedCostModelClientclass instance ships in this binary (verified by symbol grep — only proto message methods). Every cost number above is a static.rodataconstant; the "learned" code paths fall back to these tables. See learned cost-model client.
Cross-References
- CycleTable Family — the per-gen dispatcher that reads these constants; the cycle-class enum and resource LUT.
- JfCycleTable — the JF/DF flat offset-LUT read path in full.
- VfCycleTable — the Viperfish
switch-over-GetResourceUsagepath. - Performance Family Overview · Performance JF/DF · PF · VF — the grids that hold these arrays.
- MXU Latency Overview · MXU Latency JF/DF — per-
(MatmulModifier × Resource)matmul/matprep overrides. - Matmul Mode Modifiers — the
MatmulDataFormat/GainLatchModemodifiers that index the MXU cycle clusters. - Resource Enum — the
ResourceVectorslots the per-op cycles accumulate into. - Bundle-Aware Cost — the op-level vs bundle-level summation rule.
- Learned Cost-Model Client — why these static tables, not an ML predictor, drive the shipped cost model.
- Bundle Model Overview — the VLIW bundle layer.
- Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part VII — Cost & Latency Model / Core model — back to index