VfCycleTable
Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel. Other versions differ. The binary is not stripped — every symbol below is a demangled C++ name.
.textVMA == file offset;.rodataVMA == file offset (section0x84a0000);.data.rel.roVMA − 0x200000 == file offset.
Abstract
xla::jellyfish::VfCycleTable is the throughput half of the cost model for Viperfish (TpuVersion 3, v5e). It is the first CycleTable subclass where the family changes shape: where JfCycleTable reads its cycle numbers through a flat byte-offset LUT into one Performance grid, VfCycleTable routes the MXU classes through a second object — a viperfish::MxuLatencyTable — and only the vector/matrix-result classes touch the ViperfishPerformance grid. The single GetCyclesForThroughput body is therefore not a LUT read but a 32-arm switch that adapts each cycle class into one of three call shapes against two collaborating sub-tables. This page is the byte-level transcription of that switch: the full 32-entry cycle-class → (instr, resource, transpose) table, how each arm wraps MxuLatencyTable::GetResourceUsage or ViperfishPerformance::GetResourceUsage, and the StatusOr unwrap that surfaces the cycle integer.
The design fact to carry away: a VfCycleTable holds two sub-table pointers, not one. Its constructor (@0x1c89e120) allocates a ViperfishPerformance (new 0x30) at +0x10 and a MxuLatencyTable (new 0xA0) at +0x18, both owned. The matmul/matpush classes (0x00–0x10) call MxuLatencyTable::GetResourceUsage(instr, res, transpose) at +0x18; the matrix-result / cross-lane classes (0x17, 0x1b–0x1f) call ViperfishPerformance::GetResourceUsage(instr, 0xe)+1 at +0x10. This is the structural inversion versus JF: there the matmul cycles came out of the flat Performance grid; here they come out of the per-(MatmulModifier × MxuResource) reservation matrix, and the throughput integer is the same number the MXU-latency reservation model exposes — read through a res → array-index remap rather than as a flat cell.
The contract a reimplementer must honor:
GetCyclesForThroughput(cls)is aswitch, default return1.cls >= 0x20and every unpriced arm return the default1cycle; only the matmul/matpush and matrix-result arms read a sub-table.- The matmul/matpush "resource" argument is a resource-index remap, not a cycle seed.
MxuLatencyTable::GetResourceUsagemapsres 3 → array index 15(MatmulAccA) andres 0xb → array index 0(MatpushPushPort); any otherresisInvalidArgument. The returned cycle isarray[remapped_index]of the per-modifierstd::array<int,19>reservation row. - The throughput integer is
array[15](matmul) orarray[0](matpush) of the matched MXU-latency reservation row. The dispatch (res 3 → array[15],res 0xb → array[0]) is byte-confirmed, and the three throughput cells are byte-anchored to the ctor (matmularray[15]={8,16,32}, matpusharray[0]=4); the full per-MatmulModifiermatrix is owned by the MXU-latency / Performance: VF pages (see NOTE (VF-CT-1) below). GetResource(cls)is not overridden by VF. It is the gen-invariantCycleTable::GetResource(@0x1c89ce20, flat LUT@0xb438aec) shared with JF and all other gens.- Transcendentals bypass the switch via scalar virtual overrides:
EstimateSinCosCost = 154,EstimateTanCost = 170.
| Class | xla::jellyfish::VfCycleTable — serves TpuVersion 3 (Viperfish v5e) |
| Throughput reader | VfCycleTable::GetCyclesForThroughput(Instruction) @0x1c89e2c0 (vtable slot +0x10) |
| Dispatch | 32-arm switch (jump table @0xb438490, 32 × i32 self-relative); cls >= 0x20 → 1 |
ViperfishPerformance* | held at VfCycleTable+0x10 (new 0x30); read by mxres classes |
MxuLatencyTable* | held at VfCycleTable+0x18 (new 0xA0); read by matmul/matpush classes |
| MXU read | viperfish::MxuLatencyTable::GetResourceUsage(instr, res, transpose) @0x1c8ae5c0 → StatusOr<int> |
| mxres read | viperfish::ViperfishPerformance::GetResourceUsage(instr, 0xe) @0x1c8cbc40, +1 |
| res→index remap | res 3 → array[15] (MatmulAccA), res 0xb → array[0] (MatpushPushPort); else InvalidArgument |
| Reservation array | std::array<int,19> — bound index >= 0x13 → BUG() |
| Resource reader | CycleTable::GetResource(Instruction) @0x1c89ce20 (gen-invariant; no VF override) |
| Transcendentals | EstimateSinCosCost @0x1c89e480 = 154; EstimateTanCost @0x1c89e4a0 = 170 |
| Confidence | CONFIRMED (byte-anchored, decompile-verified) unless a row says otherwise |
Object Layout — Two Owned Sub-Tables
The VfCycleTable constructor (@0x1c89e120) is the clearest statement of the v4→v5 shape change. It stores the vtable and the Target*, then news and owns two distinct sub-tables:
// xla::jellyfish::VfCycleTable::VfCycleTable(const Target&) @0x1c89e120 (decompiled, exact)
void VfCycleTable(VfCycleTable *this, const Target *target) {
*((_QWORD *)this + 1) = target; // this+0x08 = Target*
*(_QWORD *)this = off_21C200D8; // this+0x00 = vtable
memset(this + 0x10, 0, 0x10); // zero the two sub-table ptrs
auto *perf = (ViperfishPerformance *)operator new(0x30);
ViperfishPerformance::ViperfishPerformance(perf);
swap_and_free(this + 0x10, perf); // this+0x10 = ViperfishPerformance*
auto *mxu = (MxuLatencyTable *)operator new(0xA0);
MxuLatencyTable::MxuLatencyTable(mxu);
swap_and_free(this + 0x18, mxu); // this+0x18 = MxuLatencyTable*
}
| Offset | Field | Owner of | Read by |
|---|---|---|---|
+0x00 | vtable off_21C200D8 | — | virtual dispatch |
+0x08 | const Target* | borrowed | device-detail accessors |
+0x10 | ViperfishPerformance* (new 0x30) | owned (default_delete) | the mxres classes 0x17, 0x1b–0x1f |
+0x18 | MxuLatencyTable* (new 0xA0) | owned (~MxuLatencyTable + free) | the matmul/matpush classes 0x00–0x10 |
QUIRK — the matmul cycle no longer lives in the
Performancegrid. A reimplementer porting from JF will look for the MXU throughput cells insideViperfishPerformanceand not find them as throughput. On VF the matmul/matpush throughput is thearray[15]/array[0]reservation cell of the separateMxuLatencyTableat+0x18. TheViperfishPerformancegrid at+0x10is consulted only for the matrix-result / cross-lane (Xlu) classes. Two objects, two read strategies, oneGetCyclesForThroughput.
The Throughput Read Path — GetCyclesForThroughput
VfCycleTable::GetCyclesForThroughput (@0x1c89e2c0) is a switch over the cycle-class ordinal (cls < 0x20 via the jump table @0xb438490; anything else falls to default and returns 1). Each priced arm dispatches into one of three shapes:
- (a) MXU —
MxuLatencyTable::GetResourceUsage(instr, res, transpose)at+0x18. Returns anabsl::StatusOr<int>; the cycle is the payload. - (b) mxres / Xlu —
ViperfishPerformance::GetResourceUsage(instr, 0xe)at+0x10, plus 1. - (c) default / fatal —
return 1, orLogMessageFatal("Unsupported PushGainsS4.")for classes0x0a/0x10.
The decompiled body, normalized to the three shapes (hex class ordinals; the decompile prints them decimal):
// xla::jellyfish::VfCycleTable::GetCyclesForThroughput(Instruction cls) @0x1c89e2c0
// (decompiled, exact dispatch; instr/res shown in hex)
int64 GetCyclesForThroughput(VfCycleTable *this, int cls) {
switch (cls) {
// ---- (a) matmul: MxuLatencyTable @ this+0x18, res 3 -> array[15] = MatmulAccA ----
case 0x00: return mxu_thru(this, 0xd4, /*res*/3, /*xpose*/0); // matmul' rate = THRU(CT0)
case 0x01: return mxu_thru(this, 0xda, 3, 0); // matmul'' rate
case 0x04: return mxu_thru(this, 0xe6, 3, 0); // matmul base rate
// ---- (a) matpush: MxuLatencyTable @ this+0x18, res 0xb -> array[0] = MatpushPushPort ----
case 0x05: return mxu_thru(this, 0x10b, /*res*/0xb, 0); // matpush rate
case 0x06: return mxu_thru(this, 0x10f, 0xb, 0); // matpush' rate
case 0x09: return mxu_thru(this, 0x115, 0xb, 0); // matpush'' rate
case 0x0b: return mxu_thru(this, 0x10b, 0xb, /*xpose*/1); // matpush rate (transposed)
case 0x0c: return mxu_thru(this, 0x10f, 0xb, 1); // matpush' rate (transposed)
case 0x0f: return mxu_thru(this, 0x115, 0xb, 1); // matpush'' rate (transposed)
// ---- (c) fatal ----
case 0x0a:
case 0x10:
LogFatal("Unsupported PushGainsS4.", /*cycle_table.cc:682*/);
// ---- (b) mxres / Xlu: ViperfishPerformance @ this+0x10, res 0xe, +1 ----
case 0x17: return VfPerf_GetResourceUsage(this, 0x128, 0xe) + 1; // mxres-class
case 0x1b: return VfPerf_GetResourceUsage(this, 0x12e, 0xe) + 1; // reduce-window Xlu
case 0x1c: return VfPerf_GetResourceUsage(this, 0x11f, 0xe) + 1; // THE conv MXRES/Xlu
case 0x1d: return VfPerf_GetResourceUsage(this, 0x123, 0xe) + 1;
case 0x1f: return VfPerf_GetResourceUsage(this, 0x12a, 0xe) + 1;
// ---- (c) default ----
default: return 1; // 0x02,0x03,0x07,0x08,0x0d,0x0e,0x11-0x16,0x18-0x1a,0x1e, cls>=0x20
}
}
mxu_thru wraps the StatusOr unwrap that follows each MxuLatencyTable::GetResourceUsage call in the decompile: the call writes a {status_qword, int_payload} pair into a stack temporary; the reader tests status != 1 (status 1 is the OK sentinel) and on a bad status sets the code 55 and calls absl::internal_statusor::ThrowBadStatusOrAccess. On the OK path it returns the int payload.
// the StatusOr<int> unwrap repeated after every MxuLatencyTable::GetResourceUsage (decompiled, exact)
int mxu_thru(VfCycleTable *this, uint16_t instr, int res, bool xpose) {
StatusOrPair tmp; // {qword status, dword int}
MxuLatencyTable::GetResourceUsage(&tmp, *(MxuLatencyTable**)(this + 0x18), instr, res, xpose);
if (tmp.status != 1) // status 1 == OK
ThrowBadStatusOrAccess(/*code*/55); // bad-status path
return tmp.int_payload;
}
NOTE — status sentinel
1is OK, not failure. Theif (v12 != 1)branch in the decompile is the bad-status branch, becauseGetResourceUsagewrites*(_QWORD*)result = 1to mark a validStatusOrpayload. A reimplementer who reads!= 1as "ok" inverts the unwrap. In practice the bad path is unreachable for the priced classes (the modifier maps always contain the keys these arms build); thethrowexists to surface a missing reservation as a hard failure rather than a silent zero.
GOTCHA — the
+1is on the mxres arms only. The matrix-result / cross-lane classes (0x17,0x1b–0x1f) returnViperfishPerformance::GetResourceUsage(instr, 0xe) + 1— the grid cell plus one. The matmul/matpush arms have no+1; they return the reservation cell verbatim. A reimplementation that factors the+1out to a common tail will over-count every MXU class by one cycle. (The+1is the issue-slot overhead the mxres path folds into the throughput; the MXU reservation already includes its own occupancy.)
The 32-Entry Cycle-Class Table
The full jump table (@0xb438490, 32 × i32 self-relative) decoded to (handler, call, role). Class ordinals are CycleTable::Instruction, the dense 0x00..0x20 enum shared across all gens (see CycleTable Family); the role labels are the gen-stable class roles. Every row below is verified against the decompile at @0x1c89e2c0.
| CT | handler | dispatch | role / value |
|---|---|---|---|
0x00 | 1c89e2e5 | MxuLat.GetResourceUsage(0xd4, res3, false) | matmul' rate = array[15] = THRU(CT0) |
0x01 | 1c89e380 | MxuLat.GetResourceUsage(0xda, res3, false) | matmul'' rate = array[15] |
0x02 | 1c89e310 | return 1 | matprep default |
0x03 | 1c89e310 | return 1 | matprep default |
0x04 | 1c89e363 | MxuLat.GetResourceUsage(0xe6, res3, false) | matmul base rate = array[15] |
0x05 | 1c89e394 | MxuLat.GetResourceUsage(0x10b, res0xb, false) | matpush rate = array[0] |
0x06 | 1c89e3a3 | MxuLat.GetResourceUsage(0x10f, res0xb, false) | matpush' rate = array[0] |
0x07 | 1c89e310 | return 1 | default |
0x08 | 1c89e310 | return 1 | default |
0x09 | 1c89e325 | MxuLat.GetResourceUsage(0x115, res0xb, false) | matpush'' rate = array[0] |
0x0a | 1c89e42c | LogMessageFatal (cycle_table.cc:682) | unsupported (PushGainsS4 abort) |
0x0b | 1c89e354 | MxuLat.GetResourceUsage(0x10b, res0xb, TRUE) | matpush rate (transposed) |
0x0c | 1c89e3d1 | MxuLat.GetResourceUsage(0x10f, res0xb, TRUE) | matpush' rate (transposed) |
0x0d | 1c89e310 | return 1 | default |
0x0e | 1c89e310 | return 1 | default |
0x0f | 1c89e334 | MxuLat.GetResourceUsage(0x115, res0xb, TRUE) | matpush'' rate (transposed) |
0x10 | 1c89e42c | LogMessageFatal | unsupported (PushGainsS4t abort) |
0x11 | 1c89e310 | return 1 | vector µop default |
0x12 | 1c89e310 | return 1 | default |
0x13 | 1c89e310 | return 1 | default |
0x14 | 1c89e310 | return 1 | default |
0x15 | 1c89e310 | return 1 | default |
0x16 | 1c89e310 | return 1 | default (the reduce-window R[5] CT 0x16) |
0x17 | 1c89e346 | VfPerf.GetResourceUsage(0x128, res0xe) + 1 | mxres-class throughput |
0x18 | 1c89e310 | return 1 | default |
0x19 | 1c89e310 | return 1 | default |
0x1a | 1c89e310 | return 1 | default |
0x1b | 1c89e410 | VfPerf.GetResourceUsage(0x12e, res0xe) + 1 | reduce-window Xlu (CT 0x1b) |
0x1c | 1c89e317 | VfPerf.GetResourceUsage(0x11f, res0xe) + 1 | THE conv MXRES/Xlu (CT 0x1c) |
0x1d | 1c89e405 | VfPerf.GetResourceUsage(0x123, res0xe) + 1 | mxres-class |
0x1e | 1c89e310 | return 1 | default |
0x1f | 1c89e372 | VfPerf.GetResourceUsage(0x12a, res0xe) + 1 | mxres-class |
QUIRK —
0x0a/0x10are fatal on VF but not on later gens. ThePushGainsS4/PushGainsS4tclasses hitLogMessageFatal("Unsupported PushGainsS4.",cycle_table.cc:682) on Viperfish. The Ghostlite (GlcCycleTable) helper maps the same ordinals to a fourth matpush variant (instr 0x166) and the6acc60406(GfcCycleTable) helper turns them into a recoverable error string ("Unsupported Matrix Operand type:") instead of a hard abort. A reimplementation that ports the VF fatal verbatim to a v6 table will crash on a class those gens legitimately price.
The 18 priced arms cover three contiguous bands: matmul (0x00/0x01/0x04), matpush forward + transposed (0x05/0x06/0x09/0x0b/0x0c/0x0f), and matrix-result/Xlu (0x17/0x1b/0x1c/0x1d/0x1f). Everything else — including the entire vector-µop band 0x11–0x16 and 0x18–0x1a — returns the default 1. The conv cost emitter pulls exactly three of these: thru(CT 0) (the matmul rate), thru(CT 5) (the matpush rate), and thru(CT 0x1c) (the Xlu / matrix-result-read rate).
The Matmul/Matpush Bridge — MxuLatencyTable::GetResourceUsage
The matmul/matpush arms are the architectural heart of the page: they show that the VF throughput number is the same integer the MXU-latency reservation model exposes. viperfish::MxuLatencyTable::GetResourceUsage(instr, res, transpose) (@0x1c8ae5c0) is not a flat lookup keyed on res. The res argument is a seed for which cell of the per-modifier reservation array to return:
// xla::viperfish::MxuLatencyTable::GetResourceUsage(Instruction instr, Resource res, bool xpose)
// @0x1c8ae5c0 (decompiled, exact dispatch)
StatusOr<int> GetResourceUsage(MxuLatencyTable *this, uint16_t instr, int res, uint8_t xpose) {
uint8_t array_index;
if (res == 3) array_index = 15; // res 3 -> MatmulAccA
else if (res == 0xb) array_index = 0; // res 0xb -> MatpushPushPort
else return InvalidArgument("Unsupported kind of resource", // mxu_latency_table_vf.cc
/*line 547*/);
Modifier key;
switch (instr) {
// --- matmul: MatmulModifier{[0] = format}, find in map @ this+0x20 ---
case 0xd4: key = MatmulModifier{ .fmt = 1 }; break; // bf16
case 0xda: key = MatmulModifier{ .fmt = 2 }; break;
case 0xe6: key = MatmulModifier{ .fmt = 6 }; break; // int8 / x8
// --- matpush: MatpushModifier via GLM transform, find in map @ this+0x00 ---
case 0x10b: key = MatpushModifier(GLMToFormat(xpose), LatchModeIsTranspose(xpose),
LatchOpcodeToMsr(0x8F)); break; // direct GLM
case 0x10f: key = MatpushModifier(GLMToFormat(xpose ^ 0xB), ...); break; // XOR 0xb
case 0x115: key = MatpushModifier(GLMToFormat(xpose | 0x14), ...); break; // OR 0x14
default: LogFatal("Unsupported opcode", /*mxu_latency_table_vf.cc:578*/);
}
array<int,19> row;
if (!map.find(key, &row)) ThrowStdOutOfRange("raw_hash_map<>::at");
if (array_index >= 0x13) BUG(); // array<int,19> bound
return StatusOr<int>{ ok, row[array_index] }; // <-- the throughput cycle
}
The matmul map lives at this+0x20, keyed by MatmulModifier{format}; the matpush map at this+0x00, keyed by MatpushModifier after the per-instr GainLatchMode transform (0x10b direct, 0x10f XOR 0xb, 0x115 OR 0x14) and LatchOpcodeToMsr(0x8F). The returned cycle is row[array_index] — row[15] (MatmulAccA) for matmul, row[0] (MatpushPushPort) for matpush.
The throughput integers — the same numbers as the reservation matrix
The route is byte-confirmed: GetResourceUsage returns row[15] (matmul, MatmulAccA) or row[0] (matpush, MatpushPushPort) of the per-MatmulModifier / per-MatpushModifier reservation array<int,19> it finds in the MXU-latency maps. The value in each cell is per-MatmulDataFormat:
thru(CT 0) = matmul rate (array[15] = MatmulAccA) | value |
|---|---|
bf16 (instr 0xd4, fmt 1) | 8 |
fmt 2 (instr 0xda) | 16 |
int8 / x8 (instr 0xe6, fmt 6) | 32 |
thru(CT 5) = matpush rate (array[0] = MatpushPushPort) | value |
|---|---|
| all formats (standard) | 4 |
| f32, no-transpose (half-cost) | 2 |
NOTE (VF-CT-1) — the cell magnitudes are byte-anchored. The reservation rows are built as
{Modifier, array<int,19>}pairsinsert_range'd by theMxuLatencyTableconstructor (@0x1c8a52c0), not a flat rodata table — but the relevant cells were traced through the ctor and confirmed: the matmularray[15](MatmulAccA) entries are written15 → 8(fmt 1, ctor L1466/1467),15 → 16(fmt 2, L1946/1947), and15 → 32(fmt 6, L2855/2856 and L3287/3288), sothru(CT 0) = {8,16,32}perMatmulDataFormatis CONFIRMED. The same{8,16,32}triple is mirrored in theViperfishPerformancegrid column r3 (see Performance: VF); the throughput route here consumes theMxuLatencyTablecopy viares 3 → array[15]. The matpusharray[0](MatpushPushPort, read bythru(CT 5)) is a flat 4 for every format (ctor L683-688 / L722-726 build{key0→4, …}), dropping to2only for the f32-no-transpose half-cost case (L650-655) — i.e. the matpush reservation is the flat{4,3,2}row, not a{2,4,8}per-format ramp. The byte-confirmed dispatch this page owns isthru(CT 0)readsrow[15],thru(CT 5)readsrow[0]; the full per-MatmulModifiermatrix lives on MXU-latency: VF.
GOTCHA —
res 3does not index slot 3. Theresargument names the concept (matmul-accumulate vs matpush-push-port), andGetResourceUsagetranslates it to the concrete sub-port column (3 → 15,0xb → 0). A reimplementation that usesresas a direct array index reads the wrong column (row[3]androw[11]instead ofrow[15]/row[0]) and silently returns a different reservation cell. The only two legalresvalues into this function are3and0xb; anything else isInvalidArgument.
The Resource Side — CycleTable::GetResource (Gen-Invariant)
VfCycleTable does not override GetResource. The decompile contains no VfCycleTable::GetResource symbol; the resource lane for a class comes from the gen-invariant CycleTable::GetResource (@0x1c89ce20), a flat LUT identical to the one JF uses:
// xla::jellyfish::CycleTable::GetResource(Instruction cls) @0x1c89ce20 (decompiled, exact, shared)
int GetResource(CycleTable *this, int cls) {
return dword_B438AEC[cls]; // resLUT @0xb438aec, 33 × int32, values 0..6
}
The cost lambda then deposits ResourceVector[GetResource(cls)] += (double)GetCyclesForThroughput(cls) — exactly as on JF. The resLUT maps the matmul band 0x00–0x04 to R[1] Matmul, the latch band 0x05–0x10 to R[0] Matpush, and the matrix-result / cross-lane classes (0x17, 0x1b–0x1f) to R[2] Xlu. So the VF switch prices the cycle differently from JF (two sub-tables instead of one flat grid), but it places the cycle into the identical 23-slot ResourceVector lane. The throughput-pricing logic is gen-private; the resource-conflict bookkeeping is shared.
NOTE — the
MxuLatencyTablearray<int,19>and theResourceVector(23 slots) are different vectors. The 19-int reservation row indexed byMatmulAccA/MatpushPushPortis the MXU-latency intra-op micro-port array (one int per MXU sub-pipeline port). The 23-slotResourceVectoris the bundle-issue accumulator the scheduler reduces withMaxResourceCycles.GetResourceUsagereads the former to produce a number;GetResourcenames a slot in the latter to deposit that number. Do not conflate the two index spaces.
Transcendentals — Scalar Virtual Overrides
Transcendental cost on VF bypasses the switch entirely. The VfCycleTable vtable carries two scalar const virtual overrides returning a fixed estimate independent of operand size:
int64 VfCycleTable::EstimateSinCosCost(...) @0x1c89e480 { return 154; } // sin / cos
int64 VfCycleTable::EstimateTanCost(...) @0x1c89e4a0 { return 170; } // tan
These are the Viperfish entries in the cross-gen transcendental table (JF/PF = 198/219, VF = 154/170, GL/GF = 142/151) — the XLU pipeline speeding up across generations. The full per-gen table is on Per-Opcode Cycle Constants.
VF vs JF — The Shape Change
The two tables answer the same two questions (GetCyclesForThroughput, GetResource) and feed the same ResourceVector, but they read entirely differently:
| Axis | JfCycleTable (TpuVersion 0/1) | VfCycleTable (TpuVersion 3) |
|---|---|---|
| Sub-tables held | one Performance* at +0x10 | ViperfishPerformance* at +0x10 and MxuLatencyTable* at +0x18 |
GetCyclesForThroughput form | flat offsetLUT[cls] byte-offset read | 32-arm switch over three call shapes |
| MXU matmul/matpush cycle | a flat cell in the Performance grid (8/1) | MxuLatencyTable::GetResourceUsage reservation cell via res→index remap |
| matrix-result / Xlu cycle | a flat cell in the Performance grid | ViperfishPerformance::GetResourceUsage(instr, 0xe) + 1 |
| priced-class selector | a literal 33-bit mask 0x19FFC0821 | the switch arms themselves |
| unsupported class | falls to default 1 | 0x0a/0x10 are LogMessageFatal |
GetResource | shared CycleTable::GetResource | shared CycleTable::GetResource (no override) |
| transcendentals | 198 / 219 | 154 / 170 |
| matmul rate source | flat cell in Performance grid | array[15] of the matched MxuLatencyTable reservation row |
What changed is the route — VF moved the matmul throughput out of the flat grid and into the per-MatmulModifier reservation matrix (read at array[15]) so it can vary by MatmulDataFormat without a separate flat cell per format. The concrete per-format magnitudes (array[15]={8,16,32}, matpush array[0]=4) are byte-anchored to the ctor and cross-owned with MXU-latency: VF (see NOTE (VF-CT-1)). Pufferfish (TpuVersion 2) is the intermediate form that first introduced the switch-over-GetResourceUsage shape; VF refines it with the MxuLatencyTable split. The Ghostlite (GlcCycleTable, res4→3/res9→9) and 6acc60406 (GfcCycleTable, res3→3/res8→8, distinct opcode set 0x121..0x147) helpers reuse the VF structure with per-gen remaps and opcode bases.
Reimplementation Recipe
A faithful VfCycleTable needs the two sub-tables plus the switch:
int GetCyclesForThroughput(const VfCycleTable *t, uint32_t cls) {
switch (cls) {
// matmul (res 3 -> MatmulAccA), matpush (res 0xb -> MatpushPushPort)
case 0x00: return mxu(t->mxu /*+0x18*/, 0xd4, 3, 0);
case 0x01: return mxu(t->mxu, 0xda, 3, 0);
case 0x04: return mxu(t->mxu, 0xe6, 3, 0);
case 0x05: return mxu(t->mxu, 0x10b, 0xb, 0);
case 0x06: return mxu(t->mxu, 0x10f, 0xb, 0);
case 0x09: return mxu(t->mxu, 0x115, 0xb, 0);
case 0x0b: return mxu(t->mxu, 0x10b, 0xb, 1);
case 0x0c: return mxu(t->mxu, 0x10f, 0xb, 1);
case 0x0f: return mxu(t->mxu, 0x115, 0xb, 1);
case 0x0a: case 0x10: fatal("Unsupported PushGainsS4."); // cycle_table.cc:682
// mxres / Xlu (ViperfishPerformance @ +0x10, res 0xe, +1)
case 0x17: return vfperf(t->perf /*+0x10*/, 0x128, 0xe) + 1;
case 0x1b: return vfperf(t->perf, 0x12e, 0xe) + 1;
case 0x1c: return vfperf(t->perf, 0x11f, 0xe) + 1; // conv Xlu
case 0x1d: return vfperf(t->perf, 0x123, 0xe) + 1;
case 0x1f: return vfperf(t->perf, 0x12a, 0xe) + 1;
default: return 1; // everything else
}
}
// mxu(): MxuLatencyTable::GetResourceUsage — remap res (3->15, 0xb->0), find modifier, return row[idx].
// GetResource is the gen-invariant CycleTable::GetResource (resLUT @0xb438aec) — no VF override.
// transcendentals bypass the switch: sin/cos -> 154, tan -> 170.
The data to embed: the MxuLatencyTable reservation matrix (the four flat_hash_map<Modifier, array<int,19>> maps and the res→index remap), the ViperfishPerformance Instruction × Resource grid (read at column 0xe), and the gen-invariant resLUT (@0xb438aec). With those three tables and the switch above, the VF throughput half is complete.
Cross-References
- CycleTable Family — the abstract base, the six-factory registry, the per-version dispatch, and the shared 33-value cycle-class enum.
- JfCycleTable — the flat offset-LUT predecessor; the page this one is the MxuLatencyTable-routed analog of, and the source of the gen-invariant
GetResource/resLUTframing. - MXU Latency: VF — the
viperfish::MxuLatencyTablereservation matrix this page reads throughGetResourceUsage: the four modifier maps, thearray<int,19>rows, theres→indexremap, and the matmul/matpush cycle values. - Performance: VF — the
ViperfishPerformanceInstruction × Resourcegrid the mxres classes (0x17,0x1b–0x1f) read at column0xe, plus theGetResourceUsageread path. - Resource Enum — the 23-slot
ResourceVectorwhose headR[0]..R[6]the gen-invariantresLUTemits, and theMaxResourceCyclesoverlap reduction. - Per-Opcode Cycle Constants — the cross-gen throughput integers and the transcendental scalar table (VF = 154/170).
- Performance: VF and MXU Latency: VF together hold the two sub-tables a
VfCycleTableowns at+0x10and+0x18. - Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part VII — Cost & Latency Model / CycleTable — back to index