MXU Latency: JF / DF
Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (BuildID md5
89edbbe81c5b328a958fe628a9f2207d, not stripped — full C++ symbols). Other versions differ. Every integer below was read out of.rodataby hand and cross-checked against the IDA decompile; the verification status is in the Confidence columns.
Abstract
This page documents the oldest-generation TensorCore MXU cost model: the Jellyfish (TPU v2) and Dragonfish (TPU v3) latency/throughput tables. Unlike every later generation, JF and DF do not use a heap-allocated latency[] array plus a 2-D Instruction × Resource reservation grid (the model that begins at Pufferfish — see Performance: PF). They use a single fixed-size 0xe00-byte inline POD struct — platforms_deepsea::jellyfish::isa::Performance — whose per-instruction throughput is a flat one-cell-per-Instruction lookup, and a separate LatencyTableJellyfish whose conflict-edge model is fifteen integers copied out of that same struct.
The reimplementer's view is two tables driven by one struct:
- Throughput —
JfCycleTable::GetCyclesForThroughput(Instruction)readsPerformance[offsetLUT[Instruction]], gated by a 33-bit valid-instruction mask. Sixteen of the 33CycleTable::Instructionordinals are priced from the struct; the rest return the default1. The MXU matprep/matmul/matrix-result ports cost 8 cycles; the vector/EUP result stages cost 1. - Latency —
LatencyTableJellyfishcopies fifteen cells out of the singletonPerformanceinto its own[+0x18..+0x50]field block. Two of those — the matmul base latency and the matprep base latency — are the only two integers DF changes relative to JF:PerformanceDfoverwritesPerformance[+0x28] 88→66and[+0x2c] 8→13. Everything else in the JF and DF cost model is byte-identical.
Both tables are served by the same JfCycleTable (xla::jellyfish::JfCycleTable, vtable 0x21c1ffb8); CycleTable::Create registers one JfCycleTable for both v2 and v3. The JfCycleTable vtable carries no GetLatency slot — the latency axis lives entirely in LatencyTableJellyfish. The two halves join in the matmul-rewrite / MRB-accumulation consumers, which read both the throughput cell and the copied latency fields.
If you have read the LLVM scheduling itineraries (InstrItineraryData), the closest analog is a per-itinerary-class operand-cycle table plus a separate forwarding/bypass latency table — except here both are baked as .rodata constants and the "itinerary class" is the CycleTable::Instruction enum, an MXU-modifier expansion rather than an opcode.
| Throughput reader | JfCycleTable::GetCyclesForThroughput(Instruction) 0x1c89dce0 |
| Throughput formula | valid = (I < 0x21) && ((0x19FFC0821 >> I) & 1); valid ? Performance[offsetLUT[I]] : 1 |
| offsetLUT | 0xb438b70 (33 × int64, Instruction → Performance byte-offset) |
| resLUT | 0xb438aec (33 × int32, Instruction → Resource column 0..6) |
| Performance struct | 0xe00 bytes, inline POD; base sentinel 0x7fffffff; PerformanceJf ctor 0x1d4930c0, PerformanceDf ctor 0x1d493060 |
| Latency table | LatencyTableJellyfish ctor 0x1c8a0c20 — copies 15 Performance cells into [+0x18..+0x50] |
| MXU issue (throughput) | 8 cycles/op for priced matprep/matmul/matres formats |
| MXU matmul base latency | JF 88 / DF 66 (Performance[+0x28] → LatencyTable[+0x50]) |
| MXU matprep base latency | JF 8 / DF 13 (Performance[+0x2c] → LatencyTable[+0x4c]) |
| EUP push→pop edge | 4 cycles (Performance[+0x30] → LatencyTable[+0x1c]) |
| JF→DF delta | exactly 2 int32 cells (+0x28, +0x2c); throughput table byte-identical |
The Throughput Reader
GetCyclesForThroughput
JfCycleTable::GetCyclesForThroughput is a four-line function. It bounds the Instruction ordinal, tests it against a 33-bit valid mask, and on a hit indexes the Performance struct (held at JfCycleTable+0x10) by a byte-offset taken from a 33-entry int64 LUT. On a miss it returns the default of 1. This is verbatim from the decompile:
// xla::jellyfish::JfCycleTable::GetCyclesForThroughput @ 0x1c89dce0 (verified)
__int64 JfCycleTable::GetCyclesForThroughput(this, unsigned int instr) {
if ( ((instr < 0x21) & (uint8_t)(0x19FFC0821uLL >> instr)) == 1 )
return *(uint32_t*)( *(uint64_t*)(this + 0x10) // Performance*
+ qword_B438B70[instr] ); // offsetLUT[instr]
return 1; // default
}
The valid mask 0x19FFC0821 selects sixteen priced ordinals; the other seventeen short-circuit to 1. The offsetLUT slot for every unpriced ordinal is literally 0x0, which is harmless because the mask test always fails before the read is reached.
NOTE — the bound is
< 0x21(33 ordinals,0x00..0x20) and the mask is exactly 33 bits wide. Ordinals≥ 0x21are never priced here; they belong to otherCycleTablepaths. A reimplementation must apply the bound and the mask — relying on the offsetLUT alone would readPerformance[0](the vtable) for unpriced ordinals.
The Resource Column
CycleTable::GetResource is a single flat lookup — each Instruction maps to one of seven Resource columns:
// xla::jellyfish::CycleTable::GetResource @ 0x1c89ce20 (verified)
__int64 CycleTable::GetResource(this, int instr) {
return dword_B438AEC[instr]; // resLUT[instr], 33 x int32, values 0..6
}
The seven distinct values 0..6 are not a private enum — they are slot indices into the 23-slot ResourceVector. The only consumer, AccumulateInstructionUsage, calls ResourceVector::Acc(GetResource(I), (double)GetCyclesForThroughput(I)), and Acc indexes [ResourceVector + Resource*8] with a hard bound of 23:
// xla::jellyfish::ResourceVector::Acc @ 0x1c89adc0 (verified)
__int64 ResourceVector::Acc(this, unsigned int resource, double cycles) {
if ( resource >= 0x17 ) __ud1(); // bound 0x17 = 23 ResourceVector slots
this[resource] += cycles; // vaddsd [rdi + resource*8]
return resource;
}
So the seven JF/DF Resource columns are the first seven ResourceVector slots: R[0] Matpush, R[1] Matmul, R[2] Xlu, R[3] VectorAlu0, R[4] VectorAlu1, R[5] VectorAluAny, R[6] VectorEup. The JF cost model populates only the MXU/vector head of the 23-slot accumulator.
The Integer Throughput Table
This is the full reconstruction of JfCycleTable::GetCyclesForThroughput over all 33 CycleTable::Instruction ordinals. The offsetLUT (0xb438b70) and resLUT (0xb438aec) columns were read byte-for-byte out of .rodata; the cycle value is the priced cell resolved in the reconstructed PerformanceJf image. JF and DF are identical for every cell (none of the priced offsets is +0x28 or +0x2c, the only two DF overrides).
Instr | offsetLUT[I] | Res | ResourceVector slot | priced | JF cyc | DF cyc |
|---|---|---|---|---|---|---|
0x00 | 0x910 | r1 | Matmul | yes | 8 | 8 |
0x01 | 0x000 | r1 | Matmul | no | 1 | 1 |
0x02 | 0x000 | r1 | Matmul | no | 1 | 1 |
0x03 | 0x000 | r1 | Matmul | no | 1 | 1 |
0x04 | 0x000 | r1 | Matmul | no | 1 | 1 |
0x05 | 0x92c | r0 | Matpush | yes | 8 | 8 |
0x06 | 0x000 | r0 | Matpush | no | 1 | 1 |
0x07 | 0x000 | r0 | Matpush | no | 1 | 1 |
0x08 | 0x000 | r0 | Matpush | no | 1 | 1 |
0x09 | 0x000 | r0 | Matpush | no | 1 | 1 |
0x0a | 0x000 | r0 | Matpush | no | 1 | 1 |
0x0b | 0x92c | r0 | Matpush | yes | 8 | 8 |
0x0c | 0x000 | r0 | Matpush | no | 1 | 1 |
0x0d | 0x000 | r0 | Matpush | no | 1 | 1 |
0x0e | 0x000 | r0 | Matpush | no | 1 | 1 |
0x0f | 0x000 | r0 | Matpush | no | 1 | 1 |
0x10 | 0x000 | r0 | Matpush | no | 1 | 1 |
0x11 | 0x000 | r6 | VectorEup | no | 1 | 1 |
0x12 | 0x33c | r4 | VectorAlu1 | yes | 1 | 1 |
0x13 | 0x340 | r4 | VectorAlu1 | yes | 1 | 1 |
0x14 | 0x344 | r3 | VectorAlu0 | yes | 1 | 1 |
0x15 | 0x39c | r5 | VectorAluAny | yes | 1 | 1 |
0x16 | 0x398 | r5 | VectorAluAny | yes | 1 | 1 |
0x17 | 0x954 | r2 | Xlu | yes | 8 | 8 |
0x18 | 0x3f8 | r6 | VectorEup | yes | 1 | 1 |
0x19 | 0x368 | r5 | VectorAluAny | yes | 1 | 1 |
0x1a | 0x3f4 | r6 | VectorEup | yes | 1 | 1 |
0x1b | 0x960 | r2 | Xlu | yes | 8 | 8 |
0x1c | 0x94c | r2 | Xlu | yes | 8 | 8 |
0x1d | 0x000 | r2 | Xlu | no | 1 | 1 |
0x1e | 0x000 | r2 | Xlu | no | 1 | 1 |
0x1f | 0x958 | r2 | Xlu | yes | 8 | 8 |
0x20 | 0x39c | r5 | VectorAluAny | yes | 1 | 1 |
The sixteen priced ordinals are exactly {0x00, 0x05, 0x0b, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1f, 0x20}. The seven 8-cycle cells are the MXU matprep/matmul/matrix-result throughput ports (0x00, 0x05, 0x0b, 0x17, 0x1b, 0x1c, 0x1f); the nine 1-cycle priced cells are vector-ALU / EUP result stages.
QUIRK — two ordinals share one cell.
Instr 0x15andInstr 0x20both readoffsetLUT = 0x39c(VectorAluAny, value 1). The flat-cell model does not require distinct offsets per ordinal; reservation columns and throughput cells are decoupled.
The Instr 0x00..0x10 band (the MXU ordinals) is produced by the MXU classifier CycleTableInstruction (0x1c89ca80), which maps matmul opcodes 0x8d..0x96 through a GainLatchMode → Instruction LUT and matprep/matpush opcodes 0x9b..0xa5 through a MatmulDataFormat → Instruction LUT (both decoded on JfCycleTable). The non-MXU band 0x11..0x20 is emitted by the HLO-level cost model as direct ordinal immediates — see IARs Per TensorCore for the full producer breakdown.
The Performance Struct and the JF→DF Delta
Layout
Performance is a 0xe00-byte (3584-byte) inline POD with the vtable at +0x00, DeviceIdentifiers at +0x08..+0x14, and the per-Instruction cost buffer over [+0x18 .. +0xdf8] (890 int32 slots). The base constructor Performance::Performance (0x1d492900) memset-fills the whole buffer with the sentinel 0x7fffffff (= INT_MAX, not 0xffffffff), broadcast from .rodata @0x84a2f60 (read byte-exact as 0x7fffffff). PerformanceJf (0x1d4930c0) then overwrites cells by copying pre-baked 16-byte .rodata blocks, broadcast fills (scalars @0x84a2b08 = 1, @0x84a2854 = 2, @0x84a2d0c = 8, all read byte-exact), and immediate stores — 116 store ops touching 419 distinct int32 slots; the other 471 stay at the sentinel.
The throughput-relevant constant blocks, read directly out of .rodata:
.rodata block | bytes (4 × int32) | lands at | role |
|---|---|---|---|
@0xa2c8a30 | {4, 105, 7, 92} | Performance[+0x18] | RPU producer / matres-self conflict floors |
@0xa2dcd30 | {88, 8, 4, 1} | Performance[+0x28] | matmul base (+0x28), matprep base (+0x2c), EUP edge (+0x30) |
@0xa2da220 | {8, 8, 8, 1} | Performance[+0x910] | MXU throughput band head (Instr 0x00) |
@0xa2cf810 | {8, 1, 1, 8} | Performance[+0x940] | MXU throughput cell [+0x94c] (Instr 0x1c) |
The DF Override
PerformanceDf is PerformanceJf plus a vtable swap and exactly one quadword store. From the decompile:
// platforms_deepsea::jellyfish::isa::PerformanceDf::PerformanceDf @ 0x1d493060 (verified)
PerformanceJf::PerformanceJf(this, dev); // build the full JF image first
*(uint64_t*)this = off_21CC7478; // swap the vtable
*((uint64_t*)this + 5) = 0xD00000042uLL; // store at this+0x28 (= int32[5..6])
this + 5 (qword) is Performance[+0x28]. The qword 0xD00000042 writes [+0x28] = 0x42 = 66 and [+0x2c] = 0x0D = 13. So:
| cell | JF | DF | role |
|---|---|---|---|
Performance[+0x28] | 88 | 66 | MXU matmul base latency |
Performance[+0x2c] | 8 | 13 | MXU matprep base latency |
Every other one of the 419 populated slots is byte-identical JF vs DF. Because neither +0x28 nor +0x2c is an offsetLUT target, GetCyclesForThroughput returns the same value on JF and DF for all 16 priced ordinals — the entire v2→v3 cost difference is these two latency-table integers.
NOTE — the v2→v3 MXU change shows up only in the matmul base latency. Dragonfish doubles the MXU count (1→2) and raises the TensorCore clock. The throughput cell stays at 8 cycles/op; the speedup is encoded as a lower matmul base latency (88→66) with a slightly higher matprep base (8→13). A reimplementation that scales the throughput cell for DF would double-count the v3 advantage.
The LatencyTableJellyfish Copy Map
LatencyTableJellyfish is the conflict-edge / forwarding latency model. Its constructor copies fifteen cells from the singleton Performance (built once by Performance::CreateTensorCore and cached as the file-scope TpuPerformanceTable(version) table) into its own [+0x18..+0x50] field block. These fields are what LatencyTableJellyfish::LatencyBetweenInternal (0x1c8a0d60) reads as the per-edge latency floor — r15 is the LatencyTable*, not the Performance*. The copy is verbatim from the constructor decompile; the Performance source is indexed as int32[] (v2[k] = Performance[k*4]):
// xla::jellyfish::LatencyTableJellyfish::LatencyTableJellyfish @ 0x1c8a0c20 (verified)
v2 = TpuPerformanceTable(version)::table; // = Performance::CreateTensorCore(...)
LT[+0x18] = *(int32*)(v2+0x44); // Performance[+0x44] (raw byte add, not v2[17])
LT[+0x1c] = v2[12]; // Performance[+0x30] ← EUP push→pop edge
LT[+0x20] = v2[9]; // Performance[+0x24]
LT[+0x24] = v2[7]; // Performance[+0x1c]
LT[+0x28] = v2[8]; // Performance[+0x20]
LT[+0x2c] = v2[597]; // Performance[+0x954] ← MXU-result throughput cell (Instr 0x17)
LT[+0x30] = v2[595]; // Performance[+0x94c] ← MXU-result throughput cell (Instr 0x1c)
LT[+0x34] = v2[580]; // Performance[+0x910] ← matprep throughput cell (Instr 0x00)
LT[+0x38] = v2[587]; // Performance[+0x92c] ← matmul throughput cell (Instr 0x05)
LT[+0x3c] = v2[456]; // Performance[+0x720] ← xpose-result cell
LT[+0x40] = v2[455]; // Performance[+0x71c] ← xpose-result B cell
LT[+0x44] = v2[262]; // Performance[+0x418]
LT[+0x48] = v2[264]; // Performance[+0x420]
LT[+0x4c] = v2[11]; // Performance[+0x2c] ← MXU matprep base (JF 8 / DF 13)
LT[+0x50] = v2[10]; // Performance[+0x28] ← MXU matmul base (JF 88 / DF 66)
The resolved values (JF / DF), with the role inferred from the LatencyBetweenInternal predicate that reads each field:
LatencyTable off | ← Performance off | JF | DF | role |
|---|---|---|---|---|
+0x18 | +0x44 | 1 | 1 | matres-result FIFO floor (UNVERIFIED role) |
+0x1c | +0x30 | 4 | 4 | EUP push→pop edge |
+0x20 | +0x24 | 92 | 92 | RPU-result floor |
+0x24 | +0x1c | 105 | 105 | RPU op→op / matres-self conflict |
+0x28 | +0x20 | 7 | 7 | UsesRpu producer floor |
+0x2c | +0x954 | 8 | 8 | MXU-result cell (Instr 0x17) |
+0x30 | +0x94c | 8 | 8 | MXU-result cell (Instr 0x1c) |
+0x34 | +0x910 | 8 | 8 | matprep throughput cell (Instr 0x00) |
+0x38 | +0x92c | 8 | 8 | matmul throughput cell (Instr 0x05) |
+0x3c | +0x720 | 8 | 8 | xpose-result cell |
+0x40 | +0x71c | 8 | 8 | xpose-result B cell |
+0x44 | +0x418 | 8 | 8 | branch-op cell |
+0x48 | +0x420 | 8 | 8 | branch-op cell |
+0x4c | +0x2c | 8 | 13 | MXU matprep base |
+0x50 | +0x28 | 88 | 66 | MXU matmul base |
Three of the first columns (+0x20/+0x24/+0x28 ← Performance[+0x24]/[+0x1c]/[+0x20]) come from the head block @0xa2c8a30 = {4,105,7,92} (note the 92/105/7 reorder through the copy indices, with the leading 4 at Performance[+0x18] left uncopied); +0x18 (← Performance[+0x44] = 1) is the second cell of the @0xa2c8a40 = {2,1,1,1} block, not the head block. The last two (+0x4c/+0x50) come from @0xa2dcd30 = {88,8,4,1} and are the only two the DF override touches. The version guard at the top of the constructor CHECKs tpu_version_ ∈ {kJellyfish, kDragonfish}, confirming this table serves only v2/v3.
GOTCHA — the matmul base and matprep base are swapped in field order vs struct order.
Performance[+0x28](matmul base) copies to the higherLatencyTable[+0x50], andPerformance[+0x2c](matprep base) to the lowerLatencyTable[+0x4c]. A reimplementation that copies the{88,8,...}block linearly into the latency-table tail will transpose the matmul/matprep base latencies.
The EUP Push→Pop Edge
Performance[+0x30] = 4 (the third element of @0xa2dcd30 = {88,8,4,1}) copies to LatencyTable[+0x1c]. LatencyBetweenInternal raises the edge latency to this 4 when the producer is an EUP push (opcode 0x128..0x13a) and the consumer is the 0x14e EUP-result pop (or a pseudo-EUP). The function also applies an independent hardcoded minimum floor of 4 (mov r12d, 0x4, raised to 5 on a SetIar-followed-by-indexed-load). Both 4s coincide on JF/DF. See EUP Latency Overview and EUP Per-Gen Integers for the per-gen progression 4 (JF/DF) → 7 (PF) → 6 (VF) → 13/14 (GL).
Transcendental Estimate Costs
The JfCycleTable vtable carries two transcendental estimate slots beyond the throughput reader. Both are constant-return functions, read verbatim:
| Function | vtable slot | value |
|---|---|---|
JfCycleTable::EstimateSinCosCost 0x1c89dd20 | +0x18 | 198 (0xc6) |
JfCycleTable::EstimateTanCost 0x1c89dd40 | +0x20 | 219 (0xdb) |
These price the Payne–Hanek-style range-reduced transcendentals on JF/DF; DF inherits them (one JfCycleTable registration serves both gens). See EUP Payne–Hanek.
How JF Differs From Every Later Gen
The flat one-cell-per-Instruction model is unique to v2/v3. From Pufferfish onward, Performance becomes a heap latency[] array plus a 2-D GetResourceUsage(Instruction, Resource) grid, and PfCycleTable::GetCyclesForThroughput (0x1c89de60) wraps GetResourceUsage calls rather than a flat offset-LUT read. Viperfish goes further: VfCycleTable::GetCyclesForThroughput (0x1c89e2c0) dispatches into viperfish::MxuLatencyTable::GetResourceUsage (confirmed in the decompile), a FlatHashMap<Modifier, array<int,19>> reservation table keyed by {MatmulDataFormat, is_transpose, Msr} — the matprep stages there are not flat cells at all but four of nineteen MxuResource ports.
| Gen | Codename | TpuVer | throughput model | resource cols | matmul base | matprep base | EUP edge |
|---|---|---|---|---|---|---|---|
| JF | Jellyfish | v2 | flat inline POD + offset-LUT | 7 | 88 | 8 | 4 |
| DF | Dragonfish | v3 | = JF + 2 cells | 7 | 66 | 13 | 4 |
| PF | Pufferfish | v4 | heap latency[] + 2-D grid | 20 | (grid) | (grid) | 7 |
| VF | Viperfish | v5p | Modifier → array<19> reservation | 28 | 131 | reservation | 6 |
| GL | Ghostlite | v6e | Modifier → array<11> reservation | 31 | (grid) | fixed rows | 13/14 |
| GF | 6acc60406 | v7 | array<11> reservation | 31 | (grid) | fixed rows | 12 |
The systolic contract — weight-stationary 128×128 array, matpush 8×128/4×256 tiles, latch / push / matmul / matres sequence — never changes; only the cost-model encoding widens. See MXU Latency Overview for the cross-gen reservation-matrix concept and the per-gen pages MXU Latency: PF, VF, GL, GF.
Related Components
| Component | Relationship |
|---|---|
JfCycleTable::GetCyclesForThroughput 0x1c89dce0 | the throughput reader (offsetLUT → Performance cell) |
CycleTable::GetResource 0x1c89ce20 | the Instruction → Resource column LUT (0xb438aec) |
ResourceVector::Acc 0x1c89adc0 | the consumer that accumulates throughput into the 23-slot vector |
PerformanceJf ctor 0x1d4930c0 / PerformanceDf ctor 0x1d493060 | the inline-POD fill and the 2-cell DF override |
LatencyTableJellyfish ctor 0x1c8a0c20 | the 15-field Performance → LatencyTable copy map |
LatencyTableJellyfish::LatencyBetweenInternal 0x1c8a0d60 | the per-edge latency floor reader (EUP edge, matmul/matprep base) |
CycleTableInstruction 0x1c89ca80 | the MXU classifier that produces Instr 0x00..0x10 |
Cross-References
- MXU Latency Overview — the
MxuResourceenum and the cross-gen reservation-matrix concept this page is the JF/DF specialization of. - JfCycleTable — the full
offsetLUT/resLUTbyte transcription and the MXU-modifierGainLatchMode/MatmulDataFormat → Instructionclassifier LUTs. - Performance: JF / DF — the full 419-slot
Performanceconstant source-block dump and the inline-POD layout. - CycleTable Family — the
CycleTable::Create(TpuVersion)factory that registers oneJfCycleTablefor v2/v3. - Resource Enum — the 23-slot
ResourceVectorwhose first 7 slots are the JF/DFResourcecolumns. - IARs Per TensorCore — the non-MXU
Instr 0x11..0x20band producer and the per-gen register/IAR counts. - EUP Latency Overview and EUP Per-Gen Integers — the EUP push→pop edge and the per-gen progression.
- MXU Latency: PF, VF, GL, GF — the later-gen reservation models JF/DF predate.
- MXU Slot — the MXU slot whose opcodes feed the
CycleTableInstructionclassifier. - Matprep / IAR / Latch — the matprep/latch/IAR sub-slots and the per-gen matprep cost divergence.