Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Per-Opcode Cycle Constants

Every offset, value, and address on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). Other versions differ. All .rodata addresses are virtual addresses; for this binary .rodata VMA == file offset (section [11] at 0x84a0000).

Abstract

This page is the data sheet for the TensorCore cost model: the actual cycle-cost integers baked into libtpu.so's .rodata, grouped by generation and engine, plus the rule by which the bundle-latency cost model sums them. The numbers come from two places that a reimplementer must keep distinct. (1) The throughput cycle of a cycle class is the value a per-gen CycleTable returns from GetCyclesForThroughput — for Jellyfish/Dragonfish a flat read Performance[offsetLUT[class]], for Pufferfish and later a switch that calls Performance::GetResourceUsage(instruction_id, resource). (2) The latency of an op is a separate per-gen Performance array entry the LatencyTable* reads. Both arrays live in the same per-gen Performance object; the cycle constants below are the contents of those arrays.

The oldest gens (JF/DF) are fully byte-pinned here: a 33-entry int64 offset LUT, the 16-of-33 priced subset, the seven 8-cycle MXU cells, the nine 1-cycle vector cells, and the constructor source blocks that fill them. For Pufferfish through 6acc60406, the per-class throughput integers are pinned from the GetResourceUsage(instr,res) call constants in each gen's switch, and the per-gen latency-value distributions are pinned from the constructor store stream; the full per-(instruction × resource) 2D grids are partially extracted (flagged below).

The contract this page documents:

  • A cycle class's throughput cost is Performance[offset] (JF/DF) or GetResourceUsage(instr,res) (PF+). The default for an unpriced class is 1 cycle.
  • The baked grid defaults to the sentinel 0x7FFFFFFF ("unsupported / not schedulable") and is overwritten by the per-gen Performance constructor with a hand-crafted broadcast/block-copy pattern.
  • Dragonfish == Jellyfish except for two cells ([+0x28], [+0x2c]), neither a throughput-LUT target, so the JF and DF throughput tables are byte-identical.
  • Bundle issue cost is per-lane max, not a global sum. Per-op cycles accumulate into a ResourceVector keyed on GetResource(class); same-lane costs add, cross-lane costs overlap.
  • Transcendentals are constants from scalar virtual overrides, not from the grid.
Offset LUT (JF/DF)qword_B438B70 @ 0xb438b70 — 33 × int64, class → Performance byte offset
Resource LUTdword_B438AEC @ 0xb438aec — 33 × int32, class → ResourceVector slot
Priced mask (JF)0x19FFC0821 — 16 classes priced; rest default 1
Sentineldword_84A2F60 @ 0x84a2f60 = 0x7FFFFFFF (unsupported)
Broadcast scalarsdword_84A2B08 = 1, dword_84A2854 = 2, dword_84A2D0C = 8
PerformanceJf ctor0x1d4930c0 — 116 stores → 419 distinct int32 cells
PerformanceDf ctor0x1d493060 — JF + one 0xD00000042 store at +0x28
Grid width0xe00 bytes (Performance::Performance @ 0x1d492900)
ConfidenceCONFIRMED (byte-anchored) for JF/DF; per-class PF+ CONFIRMED; full 2D grids PARTIAL

Where The Constants Live

Every per-gen Performance object is one flat 0xe00-byte allocation. The base constructor Performance::Performance @ 0x1d492900 broadcast-fills the entire array with the sentinel 0x7FFFFFFF (vbroadcastss xmm0, dword_84A2F60), meaning "this cell is unsupported / not schedulable." The derived per-gen constructor then overwrites the supported subset with packed constants.

There are three idioms a reimplementer will see in every Performance constructor:

IdiomDisassemblyEffect
Broadcast fillvbroadcastss xmm, dword_XXX + vmovups [base+off], xmmwrites 4 identical cells
16-byte block copyvmovaps xmm, xmmword_YYY + vmovups [base+off], xmmwrites 4 cells from a 4-int32 .rodata block
scalar / movabs storemov [base+off], imm / movabs rax,imm64; mov [base+off],raxwrites 1 or 2 cells

The three broadcast scalars carry the common cycle values: dword_84A2B08 = 1 (1-cycle ops), dword_84A2854 = 2 (2-cycle bucket), dword_84A2D0C = 8 (the 8-cycle MXU bucket). The distinctive per-instruction clusters come from 16-byte blocks; the 15 distinct blocks used by PerformanceJf are enumerated below.


Jellyfish / Dragonfish — The Flat Offset-LUT (byte-exact)

JF/DF do not use a (instruction, resource) switch. JfCycleTable::GetCyclesForThroughput reads the Performance grid directly through a 33-entry int64 offset LUT:

// xla::jellyfish::JfCycleTable::GetCyclesForThroughput @ 0x1c89dce0 (decompiled, exact)
int64_t GetCyclesForThroughput(JfCycleTable *this, uint32_t cls) {
    if (((cls < 0x21) & (uint8_t)(0x19FFC0821ULL >> cls)) == 1)   // priced?
        return *(uint32_t *)(this->performance/*+0x10*/ + qword_B438B70[cls]);
    return 1;                                                      // default
}

The priced mask 0x19FFC0821 selects 16 of the 33 classes: {0x00, 0x05, 0x0b, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1f, 0x20}. The offset LUT (qword_B438B70 @ 0xb438b70) is 0 for every unpriced class — harmless, since the read is short-circuited. The priced rows, with the byte offset, the resolved value in the rebuilt PerformanceJf image, and the constructor idiom that wrote the cell:

ClassoffsetLUT[cls]ResourceJF cycDF cycSource idiom
0x000x910R[1] Matmul88block @0xa2da220 {8,8,8,1}[0]
0x050x92cR[0] Matpush88bcast 8 (@0x84a2d0c)
0x0b0x92cR[0] Matpush88bcast 8 (shares offset with 0x05)
0x120x33cR[4] VectorAlu111bcast 1 (@0x84a2b08)
0x130x340R[4] VectorAlu111bcast 1
0x140x344R[3] VectorAlu011bcast 1
0x150x39cR[5] VectorAluAny11bcast 1
0x160x398R[5] VectorAluAny11bcast 1
0x170x954R[2] Xlu88bcast 8
0x180x3f8R[6] VectorEup11imm 1
0x190x368R[5] VectorAluAny11bcast 1
0x1a0x3f4R[6] VectorEup11bcast 1
0x1b0x960R[2] Xlu88bcast 8
0x1c0x94cR[2] Xlu88block @0xa2cf810 {8,1,1,8}[3]
0x1f0x958R[2] Xlu88bcast 8
0x200x39cR[5] VectorAluAny11bcast 1 (shares offset with 0x15)

The seven 8-cycle cells are the MXU matprep/matmul/matrix-result throughput ports (0x00, 0x05, 0x0b, 0x17, 0x1b, 0x1c, 0x1f); the nine 1-cycle cells are the vector/EUP/cross-lane result stages. Two pairs share an offset (0x05/0x0b0x92c; 0x15/0x200x39c). None of the priced offsets is 0x28 or 0x2c, so the Dragonfish two-cell override never touches a throughput cell.

NOTE — these values were re-derived, not transcribed. The full PerformanceJf image was rebuilt by emulating the base sentinel fill plus every constructor store (verified vmovaps/vbroadcastss/movabs stream from 0x1d4930c0), then read back at offsetLUT[cls]. The MXU-band stores are explicit in the constructor: [+0x910] ← xmmword_A2DA220 {8,8,8,1}, [+0x920..+0x980] ← bcast 8, [+0x940] ← xmmword_A2CF810 {8,1,1,8}. The 16-byte source blocks were read directly from .rodata and all 15 match (see below).

The 15 PerformanceJf constructor source blocks

Each is four int32. Read byte-exact from .rodata:

BlockValueBlockValueBlockValue
@0xa2c2f40{5,1,1,1}@0xa2c5b90{1,1,1,4}@0xa2c5ba0{1,1,5,5}
@0xa2c8a30{4,105,7,92}@0xa2c8a40{2,1,1,1}@0xa2cea00{2,2,1,1}
@0xa2cf810{8,1,1,8}@0xa2d2090{1,1,4,4}@0xa2d2df0{1,1,1,2}
@0xa2d3c30{8,1,1,1}@0xa2d7660{1,1,8,1}@0xa2da220{8,8,8,1}
@0xa2daf10{4,2,1,1}@0xa2db650{1,1,2,2}@0xa2dcd30{88,8,4,1}

The constructor issues 116 store ops (17 block copies + 81 broadcast fills + 9 movabs qwords + 9 dword imms) writing exactly 419 distinct int32 cells with no overlap; the remaining cells stay at the 0x7FFFFFFF sentinel. Only ~31 of the 419 cells are semantically bound (16 throughput-LUT targets + ~15 latency-table cells); the rest are populated but their consumer is not yet mapped (LOW confidence on per-cell role).

The Dragonfish delta

PerformanceDf::PerformanceDf @ 0x1d493060 calls PerformanceJf::PerformanceJf, swaps the vtable, then issues exactly one store:

// PerformanceDf::PerformanceDf @ 0x1d493060 (decompiled, exact)
PerformanceJf::PerformanceJf(this, ids);
*(void **)this = /* Df vtable */;
*((uint64_t *)this + 5) = 0xD00000042ULL;             // byte offset 0x28

*((uint64_t*)this + 5) is byte offset 5 × 8 = 0x28; the int64 0x0000000D_00000042 decodes to [+0x28] = 0x42 = 66 and [+0x2c] = 0x0d = 13 (down from JF's 88 / 8). These two cells feed the matmul/matprep base-latency slots of the LatencyTableJellyfish, not the throughput LUT — the cost-model reflection of the v2→v3 MXU-count/clock change. Every other cell is identical (verified by diffing the two rebuilt images: exactly two cells change).


Pufferfish — switch Over GetResourceUsage (per-class CONFIRMED)

From Pufferfish on, GetCyclesForThroughput is a switch over the cycle class; each case names a Performance instruction id and a resource and reads the grid through GetResourceUsage:

// xla::jellyfish::PfCycleTable::GetCyclesForThroughput @ 0x1c89de60 (decompiled, exact shape)
switch (cls) {
  case 0:        return PufferfishPerformance::GetResourceUsage(perf, 125, 9);
  case 1:        return PufferfishPerformance::GetResourceUsage(perf, 137, 9);
  case 5: case 11: return PufferfishPerformance::GetResourceUsage(perf, 220, 11);
  case 6: case 12: return PufferfishPerformance::GetResourceUsage(perf, 223, 11);
  case 9: case 15: return PufferfishPerformance::GetResourceUsage(perf, 224, 11);
  case 10: case 16: LogFatal("Unsupported PushGainsS4.", /*cycle_table.cc:575*/);
  case 23:       return PufferfishPerformance::GetResourceUsage(perf, 260, 11);
  case 24:       return PufferfishPerformance::GetResourceUsage(perf, 107, 3);
  case 26:       return PufferfishPerformance::GetResourceUsage(perf, 106, 3);
  case 27:       return PufferfishPerformance::GetResourceUsage(perf, 264, 11);
  case 28:       return PufferfishPerformance::GetResourceUsage(perf, 244, 11);
  case 29:       return PufferfishPerformance::GetResourceUsage(perf, 252, 11);
  case 31:       return PufferfishPerformance::GetResourceUsage(perf, 262, 11);
  default:       return 1;
}

The (instruction_id, resource) pairs are the primary constants. Resolving each against the Pufferfish Performance latency array gives the per-class cost:

Class(instr, res)Perf valueRole
0x00(125, 9)101bf16 matprep
0x01(137, 9)101bf16 matprep'
0x05/0x0b(220, 11)7bf16 latch
0x06/0x0c(223, 11)7int8 latch
0x09/0x0f(224, 11)7fp8 latch
0x0a/0x10fatalPushGainsS4 (cycle_table.cc:575)
0x17(260, 11)69matrix-result TC
0x18(107, 3)7r/w transpose
0x1a(106, 3)7lane compare
0x1b(264, 11)79matrix-result primary
0x1c(244, 11)126matrix-result secondary
0x1d(252, 11)126EUP primary
0x1f(262, 11)69transcendental class

The resource arguments expose the lane: 9 = matmul-issue (PF only — later gens use 11), 11 = XLU MRB, 3 = XLU input slot. The PushGainsS4 cases were declared in the enum but their cost is intentionally a fatal log, keeping those modes out of the schedulable set.

NOTE — the (instr, res) value differs from the cross-gen throughput column for the same class. GetResourceUsage(125, 9) returns 101 (instruction 125's resource-9 cell, the bf16-matprep view), whereas the per-class throughput table below lists Pufferfish class 0x00 as 79. The two numbers are different columns of the 2D (instruction × resource) grid; the throughput row used by the scheduler reads a different resource slot than the res=9 matmul-issue cell. A reimplementer must not assume the two views are the same number. Both are byte-anchored against the Pufferfish Performance array; the per-resource cell that the bundle cost model actually consumes is flagged PARTIAL in the 2D-grid section.


Per-Class Throughput Across Generations

Decoded from each gen's GetCyclesForThroughput (or …Helper) switch, cross-referenced against the per-gen Performance latency arrays. These are per-bundle-issue throughput cycles, not per-instruction latency.

ClassRoleJF/DFPuffVipGlite6acc60406
0x00Vector matprep, bf16879131192212
0x05Latch, bf16879131192212
0x0bTransposed bf16 latch879131192212
0x09Latch, fp8(dflt 1)114192204
0x120x16XLU rot / shuffle / bcast / reduce11112
0x17Matrix-result read (TC)853114192212
0x18/0x19/0x1aRW-xpose / cross-lane / lane-cmp11111
0x1bMatrix-result, primary879115122127
0x1cMatrix-result, secondary830304949
0x1dEUP unary primary111310
0x1eEUP unary secondary1311
sin/cos (scalar)transcendental estimate198198154142142
tan (scalar)transcendental estimate219219170151151

means the gen does not implement that class (the switch falls through to default 1). The matmul/matprep clusters match the per-format MXU cycles: bf16 Vf=131 / Gl=192 / Gf=212; fp8 Vf=114–115 / Gl=192 / Gf=204; fp32 shares bf16. These cross-check against the matmul mode modifiers and the per-gen MXU latency tables (JF/DF, and the per-gen Performance pages PF, VF).

Transcendental scalar costs (virtual overrides, CONFIRMED)

EstimateSinCosCost/EstimateTanCost are per-gen const virtuals returning a fixed estimate regardless of operand size:

GenSinCosTanFunctions
Jellyfish / Dragonfish1982190x1c89dd20 / 0x1c89dd40
Pufferfish1982190x1c89dfc0 / 0x1c89dfe0
Viperfish1541700x1c89e480 / 0x1c89e4a0
Ghostlite1421510x1c89ea00 / 0x1c89ea20
6acc604061421510x1c89f0e0 / 0x1c89f100

Each value was read from the function body (return <imm>;). The trend (sin/cos shrinking ~25 % from PF to GF) tracks the XLU pipeline / issue-throughput speedups across gens.


Per-Gen Latency-Array Distributions (CONFIRMED counts; full 2D PARTIAL)

The per-gen Performance object also holds the latency array the LatencyTable* reads. The value histograms below were extracted by parsing every mov dword [base+off], imm in each constructor. Constructor addresses and array sizes:

GenPerformance classCtornum_instrlatency bytesresource_count
Pufferfishxla::pufferfish::PufferfishPerformance0x1c8be080336134420
Viperfishxla::viperfish::ViperfishPerformance0x1c8c4840384153628
Ghostlitexla::ghostlite::GhostlitePerformance0x1c8cbc80476190431
6acc60406(gxc::gfc Performance, unnamed sub_1C8D3740)0x1c8d3740465186031

Notable clusters in the latency histograms (full per-value tables omitted for brevity): Pufferfish concentrates at 1 (144 entries), 83/101 (48 each), 126 (16); Viperfish at 1/2 (147/113), 121 (25), 131 (27); Ghostlite at 1/2 (149/181), 182 (24), 192 (27); 6acc60406 at 1/2 (154/198), 204 (12), 212 (15). The 200+ cycle entries cluster at instruction ids 289..330 — the bf16/fp8 MXU latch/matmul/matres ops.

PARTIAL — the full (instruction × resource) 2D grids are not enumerated here. Each Performance constructor issues 599 (PF) / 759 (VF) / 831 (GL) / 759 (GF) dword stores split between the latency array and the 2D resource_usage array; this page pins the per-class throughput constants (the named switch cases, CONFIRMED) and the latency-value distributions (CONFIRMED counts) but not every per-resource cell. Extraction is mechanical against Performance::GetResources()::kResources and is left as a follow-up. The per-gen Performance pages (PF, VF) track this.


How The Bundle-Latency Cost Model Sums Them

The cycle constants feed a two-layer accounting (the design is detailed in bundle-aware cost):

  1. Op level. For each LLO op, the emitter computes its cycle class (CycleTableInstruction for MXU ops; emitter Record* paths for vector/memory ops), reads GetCyclesForThroughput(class), and adds (double)cycles into the op's ResourceVector at slot GetResource(class). This is the AccumulateInstructionUsageResourceVector::Acc path ([rdi + Resource*8] += cycles).
  2. Bundle level. A VLIW bundle is one issue slot from the rate perspective, but multiple ops share it. The scheduler combines per-op resource vectors so that same-resource costs add (sequential on that functional unit) and different-resource costs overlap (the bundle's contribution is the per-lane max). The slowest occupied lane determines the bundle's issue cost.

Memory-transfer ops are formula-based rather than a single constant: the per-byte cycle is bytes_per_window / (lane_count × dtype_bytes) × per-lane-cycle, composed in the cost_model_util::Record*MemXferCycles family. Matmul cost is K_tile × per-(format, mxu) cycle. Transcendentals add the scalar EstimateSinCosCost() / EstimateTanCost() on top of a per-step pipeline count. The absolute-time conversion uses Target::TensorCoreFrequencyInMegaHertz() (cycles → microseconds via the 1.0e6 factor at qword_A2E0208); the cost model itself works purely in cycles.

NOTE — the shipped cost model is data-table driven, not ML. A LearnedCostModelClientOptions proto exists for configuration, but no LearnedCostModelClient class instance ships in this binary (verified by symbol grep — only proto message methods). Every cost number above is a static .rodata constant; the "learned" code paths fall back to these tables. See learned cost-model client.


Cross-References