JfCycleTable
Every offset, value, mask, and address on this page was read byte-exactly from
libtpu.soin thelibtpu-0.0.40-cp314wheel (BuildID md589edbbe81c5b328a958fe628a9f2207d). Other versions differ. All.rodataaddresses are virtual addresses; for this binary.rodataVMA == file offset (section[11]at0x84a0000), and.textVMA == file offset. The four LUTs below were dumped bystruct.unpackagainst the raw.rodatabytes and theleadisplacements re-resolved arithmetically against the disassembly.
Abstract
xla::jellyfish::JfCycleTable is the throughput half of the cost model for the two oldest TensorCore generations — Jellyfish (TpuVersion 0, v2) and Dragonfish (TpuVersion 1, v3). It is the only CycleTable subclass that reads its cycle numbers through a flat byte-offset LUT rather than a switch over Performance::GetResourceUsage; from Pufferfish onward the family changes shape (see CycleTable Family). This page is the byte-level transcription of the JF/DF read path: the four-instruction GetCyclesForThroughput body, the 33-bit priced mask 0x19FFC0821, the full 33-entry int64 offsetLUT (@0xb438b70), the full 33-entry int32 resLUT (@0xb438aec), and the two MXU-modifier classifier LUTs (latchLUT @0xb4389f4, fmtLUT @0xb438ac4) that produce the outer index.
The single design fact to carry away: a JfCycleTable holds no cycle numbers. It holds a Performance* at +0x10 and a pair of .rodata LUTs that turn a cycle class (CycleTable::Instruction, a dense 0x00..0x20 enum, not an LLO opcode) into (a) a byte offset into that Performance grid and (b) a ResourceVector slot index. The grid itself — the 0xe00-byte PerformanceJf/PerformanceDf POD and its constructor source blocks — is documented in full on Performance: JF/DF; this page covers only the index logic over it and transcribes the four lookup tables verbatim.
The contract a reimplementer must honor:
- The throughput read is
valid ? Performance[offsetLUT[cls]] : 1.valid = (cls < 0x21) && ((0x19FFC0821 >> cls) & 1). Both the bound and the mask are required — see the GOTCHA below. - Only 16 of the 33 classes are priced. The other 17 short-circuit to the default
1cycle. TheoffsetLUTslot of every unpriced class is literally0x0. GetResource(cls)is a separate flat LUT (resLUT) returning0..6. Those are direct slot indices into the 23-slotResourceVector; the scheduler doesResourceVector[resLUT[cls]] += (double)cycles.- The seven priced MXU ports cost 8 cycles; the nine priced vector/EUP stages cost 1. None of the priced offsets is
+0x28/+0x2c, so the JF→DF two-cell delta never reaches the throughput LUT — the throughput table is byte-identical JF and DF. - Transcendentals are priced by scalar virtual overrides (
EstimateSinCosCost = 198,EstimateTanCost = 219), not by the cycle-class LUT.
| Class | xla::jellyfish::JfCycleTable — serves TpuVersion 0 (Jellyfish v2) and 1 (Dragonfish v3) |
Performance* | held at JfCycleTable+0x10 (read by GetCyclesForThroughput) |
| Throughput reader | JfCycleTable::GetCyclesForThroughput(Instruction) @0x1c89dce0 (vtable slot +0x10) |
| Throughput formula | valid = (cls < 0x21) && ((0x19FFC0821 >> cls) & 1); valid ? Performance[offsetLUT[cls]] : 1 |
| Priced mask | 0x19FFC0821 — 16 of 33 classes priced; rest default 1 |
| offsetLUT | qword_B438B70 @0xb438b70 — 33 × int64, class → Performance byte offset |
| Resource reader | CycleTable::GetResource(Instruction) @0x1c89ce20 (gen-invariant) |
| resLUT | dword_B438AEC @0xb438aec — 33 × int32, class → ResourceVector slot 0..6 |
| MXU classifier | CycleTableInstruction(LloInstruction*) @0x1c89ca80 (MXU band only) |
| latchLUT / fmtLUT | unk_B4389F4 @0xb4389f4 (52 × int32) / unk_B438AC4 @0xb438ac4 (11 × int32) |
| Transcendentals | EstimateSinCosCost @0x1c89dd20 = 198; EstimateTanCost @0x1c89dd40 = 219 |
| Confidence | CONFIRMED (byte-anchored) unless a row says otherwise |
The Throughput Read Path — GetCyclesForThroughput
JfCycleTable::GetCyclesForThroughput is a four-instruction function. It bounds the cycle-class ordinal at < 0x21, tests it against the 33-bit valid mask, and on a hit indexes the Performance grid (held at JfCycleTable+0x10) by a byte offset pulled from the 33-entry int64 offsetLUT. On a miss it returns the default 1. Verbatim from the decompile:
// xla::jellyfish::JfCycleTable::GetCyclesForThroughput @0x1c89dce0 (decompiled, exact)
__int64 GetCyclesForThroughput(JfCycleTable *this, unsigned int cls) {
__int64 result = 1;
if ( ((cls < 0x21) & (unsigned __int8)(0x19FFC0821uLL >> cls)) == 1 )
return *(unsigned int *)( *(_QWORD *)(this + 0x10) // Performance*
+ qword_B438B70[cls] ); // offsetLUT[cls]
return result; // default 1
}
The raw machine code pins every constant (objdump -d -M intel):
1c89dce0: 89 f1 mov ecx,esi ; ecx = cls
1c89dce2: 83 fe 21 cmp esi,0x21 ; cls < 0x21 ?
1c89dce5: 0f 92 c0 (setb al) ; al = (cls < 0x21)
1c89dce8: 48 ba 21 08 fc 9f 01 movabs rdx,0x19ffc0821 ; the priced mask
1c89dcf2: 48 d3 ea shr rdx,cl ; rdx >>= cls
1c89dcf5: 20 c2 and dl,al ; dl = mask_bit & in_bound
1c89dcf7: b8 01 00 00 00 mov eax,0x1 ; default = 1
1c89dcfc: 80 fa 01 cmp dl,0x1 ; valid ?
( 75 .. jne 1c89dd15) ; -> ret 1
1c89dd01: 89 c8 mov eax,ecx ; eax = cls
1c89dd03: 48 8d 0d 66 ae b9 ee lea rcx,[rip+0xeeb9ae66] ; = 0xb438b70 (offsetLUT)
1c89dd0a: 48 8b 04 c1 mov rax,QWORD PTR [rcx+rax*8] ; rax = offsetLUT[cls]
1c89dd0e: 48 8b 4f 10 mov rcx,QWORD PTR [rdi+0x10] ; rcx = Performance*
1c89dd12: 8b 04 01 mov eax,DWORD PTR [rcx+rax*1] ; eax = Performance[offset]
1c89dd15: c3 ret
The lea displacement is RIP-relative from the next instruction address (0x1c89dd0a): 0x1c89dd0a + 0xeeb9ae66 = 0x1_0b438b70, and the 64-bit add carries out of bit 32, leaving the effective address 0xb438b70 — the offsetLUT. (objdump annotates the target VA directly.) The Performance* is at [rdi+0x10] = JfCycleTable+0x10, the cycle read is a 4-byte int32 load.
GOTCHA — the bound and the mask are not redundant. A reimplementation must apply both
cls < 0x21and the 33-bit mask. The bound guards the 33-entry LUT against out-of-range reads; the mask selects the 16 priced classes inside the bound. Relying on theoffsetLUTalone would readPerformance[0](the vtable pointer, not a cycle value) for every unpriced class, because the LUT slot of every unpriced class is literally0x0. The mask short-circuits before that read is ever reached.
The priced mask 0x19FFC0821
Reading the mask bit-for-bit ((0x19FFC0821 >> cls) & 1) partitions the 33-value enum into 16 priced and 17 unpriced classes:
mask = 0x1_9FFC_0821 = 0b1 1001 1111 1111 1100 0000 1000 0010 0001
bit: 32 ... 0
| classes | count | |
|---|---|---|
Priced (read Performance[offsetLUT[cls]]) | 0x00, 0x05, 0x0b, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1f, 0x20 | 16 |
Unpriced (return default 1) | 0x01, 0x02, 0x03, 0x04, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x1d, 0x1e | 17 |
The offsetLUT — qword_B438B70 (33 × int64, byte-exact)
The class→byte-offset table, transcribed directly from .rodata @0xb438b70 as 33 little-endian int64. Each priced entry's resolved value is the cell read back from the reconstructed PerformanceJf in-memory image at that byte offset (every MXU-band cell is explicit in the PerformanceJf constructor; the 8-cycle cells come from xmmword_A2DA220 = {8,8,8,1} at [+0x910], the dword_84A2D0C = 8 broadcasts over [+0x920..+0x980], and xmmword_A2CF810 = {8,1,1,8} at [+0x940]).
| Class | offsetLUT[cls] | priced | JF/DF cyc | Source idiom (where the cell is written) |
|---|---|---|---|---|
0x00 | 0x910 | yes | 8 | block xmmword_A2DA220 = {8,8,8,1} [0] |
0x01 | 0x000 | no | 1 | — (mask short-circuits) |
0x02 | 0x000 | no | 1 | — |
0x03 | 0x000 | no | 1 | — |
0x04 | 0x000 | no | 1 | — |
0x05 | 0x92c | yes | 8 | bcast dword_84A2D0C = 8 ([+0x920..+0x930] run) |
0x06 | 0x000 | no | 1 | — |
0x07 | 0x000 | no | 1 | — |
0x08 | 0x000 | no | 1 | — |
0x09 | 0x000 | no | 1 | — |
0x0a | 0x000 | no | 1 | — |
0x0b | 0x92c | yes | 8 | bcast 8 (shares offset 0x92c with 0x05) |
0x0c | 0x000 | no | 1 | — |
0x0d | 0x000 | no | 1 | — |
0x0e | 0x000 | no | 1 | — |
0x0f | 0x000 | no | 1 | — |
0x10 | 0x000 | no | 1 | — |
0x11 | 0x000 | no | 1 | — |
0x12 | 0x33c | yes | 1 | bcast dword_84A2B08 = 1 ([+0x334] run) |
0x13 | 0x340 | yes | 1 | bcast 1 ([+0x344] run) |
0x14 | 0x344 | yes | 1 | bcast 1 |
0x15 | 0x39c | yes | 1 | bcast 1 ([+0x394] run) |
0x16 | 0x398 | yes | 1 | bcast 1 |
0x17 | 0x954 | yes | 8 | bcast 8 ([+0x950..+0x980] run) |
0x18 | 0x3f8 | yes | 1 | imm 1 (*((_DWORD*)this+254) = 1) |
0x19 | 0x368 | yes | 1 | bcast 1 ([+0x364] run) |
0x1a | 0x3f4 | yes | 1 | bcast 1 ([+0x3e8] run) |
0x1b | 0x960 | yes | 8 | bcast 8 |
0x1c | 0x94c | yes | 8 | block xmmword_A2CF810 = {8,1,1,8} [3] |
0x1d | 0x000 | no | 1 | — |
0x1e | 0x000 | no | 1 | — |
0x1f | 0x958 | yes | 8 | bcast 8 |
0x20 | 0x39c | yes | 1 | bcast 1 (shares offset 0x39c with 0x15) |
The seven 8-cycle cells are the MXU matprep/matmul/matrix-result throughput ports — classes 0x00, 0x05, 0x0b, 0x17, 0x1b, 0x1c, 0x1f. The nine 1-cycle priced cells are the vector-ALU / cross-lane / EUP result stages — classes 0x12, 0x13, 0x14, 0x15, 0x16, 0x18, 0x19, 0x1a, 0x20.
QUIRK — distinct classes alias a shared cell. The flat-cell model does not require a distinct
Performanceoffset per class.0x05/0x0bboth read0x92c(the transposed-bf16 latch and the plain latch land on the same throughput cell), and0x15/0x20both read0x39c. The eight distinct 8-cycle offsets collapse to six values (0x910, 0x92c, 0x94c, 0x954, 0x958, 0x960), and the throughput cell is decoupled from theresLUTresource column — two classes can share a cell yet block different lanes (0x05→R[0]vs0x0b→R[0]here happen to match, but0x15/0x20both→R[5]). A reimplementation that keys a cache on the offset alone will silently merge these classes.
The resLUT — CycleTable::GetResource / dword_B438AEC (33 × int32, byte-exact)
CycleTable::GetResource is gen-invariant (shared by all six CycleTable subclasses, not a JfCycleTable override) and is a single flat lookup:
// xla::jellyfish::CycleTable::GetResource @0x1c89ce20 (decompiled, exact)
__int64 GetResource(CycleTable *this, int cls) {
return dword_B438AEC[cls]; // resLUT @0xb438aec, 33 × int32, values 0..6
}
The displacement check matches: the lea rcx,[rip+0xeeb9bcc3] at 0x1c89ce22 resolves to 0x1c89ce29 + 0xeeb9bcc3 = 0x1_0b438aec → effective 0xb438aec. The full 33 entries, transcribed from .rodata:
resLUT @0xb438aec (cls 0x00 .. 0x20):
1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 6 4 4 3 5 5 2 6 5 6 2 2 2 2 2 5
| Class band | resLUT[cls] | ResourceVector slot |
|---|---|---|
0x00–0x04 | 1 | R[1] Matmul |
0x05–0x10 | 0 | R[0] Matpush |
0x11 | 6 | R[6] VectorEup |
0x12, 0x13 | 4 | R[4] VectorAlu1 |
0x14 | 3 | R[3] VectorAlu0 |
0x15, 0x16 | 5 | R[5] VectorAluAny |
0x17 | 2 | R[2] Xlu |
0x18 | 6 | R[6] VectorEup |
0x19 | 5 | R[5] VectorAluAny |
0x1a | 6 | R[6] VectorEup |
0x1b–0x1f | 2 | R[2] Xlu |
0x20 | 5 | R[5] VectorAluAny |
Why these values are ResourceVector slot indices
GetResource returns a raw integer, but the only consumer treats it directly as a slot index into the 23-slot ResourceVector. The cost lambda AccumulateInstructionUsage @0x144fd720 is the binding proof:
// xla::jellyfish::sdc_checker::...::AccumulateInstructionUsage::operator() @0x144fd720 (decompiled)
__int64 operator()(CycleTable &ct, ResourceVector &rv, unsigned int cls) {
unsigned int Resource = CycleTable::GetResource(&ct, cls); // resLUT[cls]
__int64 cyc = (*(vtable + 0x10))(&ct, cls); // GetCyclesForThroughput(cls)
ResourceVector::Acc(&rv, Resource, (double)cyc); // rv[Resource] += cyc
return 1;
}
and ResourceVector::Acc @0x1c89adc0 is a bounds-checked += keyed on that integer:
// xla::jellyfish::ResourceVector::Acc @0x1c89adc0 (decompiled, exact)
__int64 Acc(ResourceVector *this, unsigned int resource, double cycles) {
if ( resource >= 0x17 ) __ud1(); // hard bound 0x17 = 23 slots
this[resource] += cycles; // vaddsd [rdi + resource*8]
return resource;
}
The cmp esi, 0x17 / jae ud1 bound fixes the vector at 23 double slots; the JF/DF resLUT emits only the values 0..6, i.e. the MXU/vector head of the 23-slot accumulator. The remaining slots R[7..22] (vector load/store, the four memory-transfer terms, six ICI links, three SparseCore engines) are deposited into by other cost paths, never by this flat LUT — see the Resource Enum for the whole vector and the MaxResourceCycles overlap reduction.
The 7 JF Resource Columns — Named
The seven distinct resLUT values are not a private cost enum; they are the first seven ResourceVector::Resource ordinals. The names come straight from the printf template inside ResourceVectorToString @0x1c89bde0, which formats the 22 leading slots in physical-offset order, so the string order is the enum order. The template head (read byte-for-byte):
RV[Matpush: %.0f, Matmul: %.0f, Xlu: %.0f, VectorAlu0: %.0f, VectorAlu1: %.0f,
VectorAluAny: %.0f, VectorEup: %.0f, VectorLoad: %.0f, VectorStore: %.0f, ...]
The first seven %.0f slots map one-for-one onto the resLUT values 0..6:
| Res | ResourceVector slot | Name | JF/DF occupant classes | Throughput |
|---|---|---|---|---|
0 | R[0] +0x00 | Matpush | 0x05–0x10 (latch / push-gains band) | 0x05,0x0b = 8; rest dflt 1 |
1 | R[1] +0x08 | Matmul | 0x00–0x04 (matprep band) | 0x00 = 8; rest dflt 1 |
2 | R[2] +0x10 | Xlu | 0x17, 0x1b–0x1f (matrix-result / cross-lane) | 0x17,0x1b,0x1c,0x1f = 8 |
3 | R[3] +0x18 | VectorAlu0 | 0x14 | 1 |
4 | R[4] +0x20 | VectorAlu1 | 0x12, 0x13 | 1 |
5 | R[5] +0x28 | VectorAluAny | 0x15, 0x16, 0x19, 0x20 | 1 |
6 | R[6] +0x30 | VectorEup | 0x11, 0x18, 0x1a | 0x18,0x1a = 1; 0x11 dflt 1 |
This is how the JF/DF cost model expresses resource conflict: two classes mapped to the same column add (they serialize on that functional-unit lane); two classes in different columns overlap (the bundle's contribution is the per-lane max, not the sum — MaxResourceCycles). Note the column index is decoupled from the matmul/matprep opcode intuition — resLUT maps the matprep band (0x00–0x04) to R[1] Matmul and the latch band (0x05–0x10) to R[0] Matpush, which is the matmul-pipeline ordering, not a naive opcode-to-name pairing.
NOTE —
R[k]names are binding-confirmed; micro-port semantics are not. The seven names are CONFIRMED from theResourceVectorToStringtemplate and theAccslot index. The deeper physical mapping (which MXU sub-stage each column reserves) has noToStringfor theCycleTable::ResourceorCycleTable::Instructioncost enums and remains functional/INFERRED.
Producing the Outer Index — CycleTableInstruction and the two MXU LUTs
The GetCyclesForThroughput/GetResource argument is a cycle class, not an opcode. For the MXU band, the class is produced by CycleTableInstruction(LloInstruction*) @0x1c89ca80, the only LLO→class classifier in the binary. It classifies exactly two opcode bands and is fatal on anything else:
// xla::jellyfish::CycleTableInstruction @0x1c89ca80 (decompiled, exact shape)
uint32_t CycleTableInstruction(const LloInstruction *insn) {
uint32_t op = insn->opcode;
if ((uint16_t)(op - 141) <= 9) { // opcodes 0x8d..0x96 = matmul / latch band
uint8_t lm = insn->latch_mode();
if (lm >= 0x34 || !bittest64(0xF000003FFFC3F, lm))
LogFatal("Unsupported gain latch mode ", /*cycle_table.cc:431*/);
return unk_B4389F4[lm]; // latchLUT @0xb4389f4, 52 × int32
}
if ((uint16_t)(op - 155) <= 0xA) { // opcodes 0x9b..0xa5 = matprep / matpush band
uint8_t f = insn->matmul_data_format() - 1;
if (f >= 0xA)
LogFatal("Unsupported matmul data format ", /*cycle_table.cc:464*/);
return unk_B438AC4[f]; // fmtLUT @0xb438ac4, first 11 × int32
}
LogFatal("Unsupported instruction ", /*cycle_table.cc:470*/);
}
The classifier covers only the MXU band (classes 0x00–0x10). The vector/EUP/matrix-result classes 0x11–0x20 are not produced here — they are emitted by the HLO-level cost model as direct ordinal immediates keyed on HLO opcode / PrimitiveType / memory-transfer role (see IARS Per TensorCore); their producing LLO opcode set is not decoded (MEDIUM).
latchLUT — unk_B4389F4 (52 × int32, valid mask 0xF000003FFFC3F)
GainLatchMode → CycleTable::Instruction, indexed by the raw latch_mode() byte. Only the modes set in the valid mask are reachable; all others LogFatal before the read. Transcribed byte-exact (only the non-zero / reachable rows shown; every other index reads 0):
latch_mode() | latchLUT[lm] → class | latch_mode() | latchLUT[lm] → class |
|---|---|---|---|
0x00, 0x02, 0x04 | 5 | 0x12, 0x14, 0x16, 0x18 | 9 |
0x01, 0x03, 0x05 | 11 | 0x13, 0x15, 0x17, 0x19 | 15 |
0x0a | 12 | 0x30 | 7 |
0x0b, 0x0e, 0x10 | 6 | 0x31 | 13 |
0x0c | 9 | 0x32 | 8 |
0x0d | 15 | 0x33 | 14 |
0x0f, 0x11 | 12 | (all others, mask-rejected) | 0 |
The valid mask 0xF000003FFFC3F is checked by _bittest64 before the lookup; the four high bits (0x30–0x33, the 0xF at bit 48..51) map to classes {7, 13, 8, 14} (CONFIRMED from the .rodata dump) and are only reachable on the gens that emit those latch modes. Their attribution to a specific later generation is not byte-anchored (INFERRED).
fmtLUT — unk_B438AC4 (first 11 × int32)
MatmulDataFormat → CycleTable::Instruction, indexed by matmul_data_format() - 1 (the JF reader validates < 0xA, so it reads the first 11 entries). Transcribed byte-exact:
fmtLUT @0xb438ac4 (index = matmul_data_format()-1):
index: 0 1 2 3 4 5 6 7 8 9 10
class: 0 1 1 1 4 4 4 4 2 3 1
i.e. fmt 1 → class 0, fmt 2/3/4 → class 1, fmt 5/6/7/8 → class 4, fmt 9 → class 2, fmt 10 → class 3, fmt 11 → class 1. The MXU-modifier semantics (which numeric format is bf16 / fp8 / int8) are documented with the matmul mode modifiers.
NOTE — the format LUT is wider than the JF reader uses. Beyond index 10, additional packed-int8 / int4 format values exist in
.rodatabut the JF classifier's< 0xAvalidity check rejects them (they are used by later-gen paths). They are flagged INFERRED for JF/DF here because the JF classifierLogFatals on them.
JF vs DF — The Throughput Table Is Identical
PerformanceDf::PerformanceDf @0x1d493060 builds the full PerformanceJf image, swaps the vtable, and issues exactly one quadword store:
// platforms_deepsea::jellyfish::isa::PerformanceDf::PerformanceDf @0x1d493060 (decompiled, exact)
PerformanceJf::PerformanceJf(this, dev);
*(_QWORD *)this = off_21CC7478; // swap vtable → PerformanceDf
*((_QWORD *)this + 5) = 0xD00000042uLL; // store at this+0x28
this + 5 (qword) is Performance[+0x28]; the int64 0x0000000D_00000042 writes [+0x28] = 0x42 = 66 (down from JF's 88) and [+0x2c] = 0x0D = 13 (up from JF's 8). Neither 0x28 nor 0x2c appears in the offsetLUT (the priced offsets are {0x910, 0x92c, 0x94c, 0x954, 0x958, 0x960, 0x33c, 0x340, 0x344, 0x368, 0x398, 0x39c, 0x3f4, 0x3f8}), so GetCyclesForThroughput returns the same 16 values on JF and DF. The two changed cells are matmul / matprep base latencies consumed by LatencyTableJellyfish, not throughput cells — the entire v2→v3 cost difference lives on the latency axis (MXU Latency: JF/DF), and the throughput table on this page is byte-identical for both generations.
Transcendentals — Scalar Virtual Overrides
Transcendental cost on JF/DF does not go through the cycle-class LUT. The JfCycleTable vtable carries two scalar const virtual overrides (slots +0x18 / +0x20) that return a fixed estimate regardless of operand size:
__int64 JfCycleTable::EstimateSinCosCost(...) @0x1c89dd20 { return 198; } // sin / cos
__int64 JfCycleTable::EstimateTanCost(...) @0x1c89dd40 { return 219; } // tan
These are added on top of a per-step pipeline count by the cost model's transcendental path. The values match Pufferfish (also 198 / 219) and then shrink across the later gens (VF 154 / 170, GL/GF 142 / 151) as the XLU pipeline speeds up; the full per-gen table is on Per-Opcode Cycle Constants.
Reimplementation Recipe
A faithful JfCycleTable needs only the four LUTs above plus the Performance grid:
int GetCyclesForThroughput(const Performance *perf, uint32_t cls) {
if (cls < 0x21 && ((0x19FFC0821ULL >> cls) & 1))
return *(const int32_t *)((const char *)perf + offsetLUT_B438B70[cls]);
return 1; // default
}
int GetResource(uint32_t cls) {
return resLUT_B438AEC[cls]; // 0..6 = ResourceVector slot
}
// per-op accumulation (the only consumer):
// resource_vector[GetResource(cls)] += (double)GetCyclesForThroughput(perf, cls);
// transcendentals bypass the LUT: sin/cos -> 198, tan -> 219.
The four .rodata tables to embed verbatim:
| Table | Address | Shape | Maps |
|---|---|---|---|
offsetLUT | 0xb438b70 | 33 × int64 | class → Performance byte offset (this page) |
resLUT | 0xb438aec | 33 × int32 | class → ResourceVector slot 0..6 (this page) |
latchLUT | 0xb4389f4 | 52 × int32 (mask 0xF000003FFFC3F) | GainLatchMode → class |
fmtLUT | 0xb438ac4 | 11 × int32 (index fmt-1) | MatmulDataFormat → class |
Build PerformanceJf per Performance: JF/DF (the 0xe00-byte POD, sentinel 0x7FFFFFFF, 419 ctor stores), apply the one DF cell patch for Dragonfish, and the throughput half is complete.
Cross-References
- CycleTable Family — the abstract base, the six-factory registry, the per-version dispatch, and the shared cycle-class enum framing.
- Performance: JF/DF — the
0xe00-bytePerformanceJf/PerformanceDfPOD that this page'soffsetLUTindexes into: the layout, the 419 constructor stores, the 15.rodatasource blocks, and the 2-cell DF delta. - Per-Opcode Cycle Constants — the cross-gen throughput integers and the transcendental scalar table; the JF/DF flat-cell model in context.
- VfCycleTable — the Viperfish
switch-over-GetResourceUsagepath the model evolves into. - Resource Enum — the 23-slot
ResourceVectorwhose headR[0]..R[6]theresLUTemits, and theMaxResourceCyclesoverlap reduction. - MXU Latency: JF/DF — the orthogonal latency axis; the
LatencyTableJellyfishcopy map and the JF→DF88→66 / 8→13base-latency delta. - Matmul Mode Modifiers — the
GainLatchMode/MatmulDataFormatmodifiers thatlatchLUT/fmtLUTfold into cycle classes. - IARS Per TensorCore — the non-MXU emitter path that produces cycle classes
0x11..0x20as direct ordinal immediates. - Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part VII — Cost & Latency Model / CycleTable — back to index