Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Performance: VF

All addresses on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build libtpu_lts_20260413_b_RC00, BuildID md5 89edbbe81c5b328a958fe628a9f2207d). The binary is not stripped — every symbol below is a demangled C++ name. Section map: .text/.rodata VMA == file offset; .data.rel.ro VMA − 0x200000 == file offset.

Abstract

This page dumps the Viperfish (v5/v5e) ViperfishPerformance object: the per-instruction latency array plus the Instruction × Resource occupancy grid that prices how many cycles each LLO instruction holds each intra-op micro-pipeline port. It is the VF concretization of the grid family documented in performance-overview — the heap latency-array + heap 2D vector<int> grid that Pufferfish, Viperfish, Ghostlite, and 6acc60406 all share, read by a byte-identical GetResourceUsage. Viperfish sits in the middle of the resource-count progression (7 → 7 → 20 → 28 → 31 → 31): it widens Pufferfish's 20-column grid by 8 columns, chiefly the 4-stage matprep throughput group and the transpose-binary result sub-stages.

The reference frame is an LLVM SchedMachineModel: GetLatency(instr) returns the instruction's pipeline depth (the value the scheduler raises a true-dependency edge to), and GetResourceUsage(instr, res) returns how many cycles instr occupies functional-unit port res. The VF grid is reconstructed by reading the constructor that fills it — ViperfishPerformance::ViperfishPerformance @0x1c8c4840, a new 0x600 latency array (384 int32) and a new 0x2400 grid (384 rows × 24-byte vector<int>, each row new 0x70 = 28 int32) — not from a .td file. Both the latency array (memset to 0xff, then every slot overwritten) and the 378 populated grid cells are byte-exact from the ctor's 762 DWORD-immediate stores (384 latency + 378 grid).

The page documents the object layout and the GetResourceUsage read path (the 24-byte row stride, the two bounds checks), the 28 resource columns named by occupant LLO class, the deposit column (res 0x0e, the conv R[2] Xlu term, read by GetXluPathReservation), the latency-array histogram (the 121/131 matmul/matprep base latencies), and how an LLO opcode reaches a grid row through GetViperfishInstruction and the MXU latch-mode fan-out. It closes by relating this grid to the separate MxuLatencyTable reservation matrix that co-exists with it.

For reimplementation, the contract is:

  • The ViperfishPerformance object layout: heap latency array (384 int32) + heap 2D grid (384 × 28), with exact sizes and the [this+0x18]/[this+0x20] grid pointer/count.
  • The GetResourceUsage(instr, res) read path — two bounds checks, the 24-byte (3 × a2 << 3) row stride.
  • The 28 resource columns named by occupant, and the Xlu/matrix-result deposit column (res 0x0e) with its GetXluPathReservation accessor.
  • How an LLO opcode reaches a grid row (GetViperfishInstruction, the matmul/matprep latch-mode WORD-table fan-out) and how the throughput cells co-exist with the MxuLatencyTable.
Classxla::viperfish::ViperfishPerformance
CtorViperfishPerformance::ViperfishPerformance @0x1c8c4840 (new 0x600 latency, new 0x2400 grid, row new 0x70)
Read pathViperfishPerformance::GetResourceUsage @0x1c8cbc40 · GetLatency @0x1c8cbc20 · GetResources @0x1c8cbc00
Resource count28 (row width new 0x70 = 28 int32; GetResources count 0x1c)
Latency rows384 (new 0x600; memset 0xff, all overwritten)
Grid384 rows × 28-wide; 378 populated cells across 128 rows
Deposit columnres 0x0e (14) — conv R[2] Xlu term, cell = 8 for matrix-result ops
kResources order@0xb43cda8 (28 B permutation of {0..27})
Opcode → rowGetViperfishInstruction @0x1c8a3300 (jt @0xb43a104, idx = op−1)
Xlu accessorLatencyTableViperfish::GetXluPathReservation @0x1c8a3200 (res = 0xe)
EUP push / pop latencypush (instr 0xcc..0xd2) = 6, pop (instr 0x168) = 1

Object Layout

Purpose

ViperfishPerformance answers the scheduler's two per-instruction questions — pipeline depth (GetLatency) and per-port occupancy (GetResourceUsage) — from two heap objects: a flat latency array indexed by Instruction, and a 2D grid of one vector<int> per instruction.

Structure

The layout is the shared grid-family layout (performance-overview), with VF's specific sizes:

struct ViperfishPerformance {       // built by ctor @0x1c8c4840
    int32*  latency;                // +0x00 ; new 0x600 (384 int32), memset 0xff
    u64     latency_size;           // +0x08 ; = 384
    u64     latency_cap;            // +0x10 ; = 384
    vector<int>* grid;              // +0x18 ; new 0x2400 (384 × 24-byte vector<int>)
    u64     grid_outer_count;       // +0x20 ; = 384  (the GetResourceUsage outer bound)
    u64     grid_outer_cap;         // +0x28 ; = 384
    // each row (24-byte std::vector<int>): { int* data (new 0x70 = 28 int32, zero-init), size=28, cap=28 }
};

The ctor emits exactly 762 DWORD-immediate stores = 384 latency + 378 grid cells. The store-count is the integrity proof: every mov DWORD PTR [rax+off], imm is classified by the base it loaded — [rbx] (latency base) or [r12] (grid base, r12 = [rbx+0x18]) — with zero non-rax-base DWORD-imm stores. Unlike Ghostlite/6acc60406, where the 0xff memset default survives on unpriced rows, every VF latency slot (0..383) is explicitly overwritten; the 0xff = 255 default does not survive.

QUIRK — the VF Performance ctor @0x1c8c4840 does not produce Hex-Rays pseudocode (it is too large / mixes VEX stores), so the cell values here are read from the byte-exact objdump decode of its 762 DWORD-immediate stores, cross-checked by the store-count identity (762 = 384 + 378). The read path GetResourceUsage and GetResources decompile cleanly and confirm the 28-wide / 384-row shape; the cell integers are objdump-grade, not Hex-Rays-grade — HIGH confidence by the store-count method.


The GetResourceUsage Read Path

Purpose

GetResourceUsage(instr, res) is the single accessor the throughput model and the Xlu-reservation accessor go through to read a grid cell. It is byte-identical across PF/VF/GL/GF — two bounds checks and a lea-computed 24-byte row stride.

Algorithm

Byte-confirmed in the decompile:

function ViperfishPerformance::GetResourceUsage(perf, instr, res):   // @0x1c8cbc40
    if perf.grid_outer_count <= instr:        // [perf+0x20] = 384 ; outer bound
        BUG()                                  // ud2
    grid = perf.grid                           // [perf+0x18]
    v4 = 3 * instr                             // row index ×3 ...
    if perf.grid[v4].size <= res:              // [grid + 8*v4 + 8] ; inner bound = 28 (row width)
        BUG()
    return *(int*)(perf.grid[v4].data + 4*res) // [grid + 8*v4] + 4*res = grid[instr][res]

The 3 * instr followed by the 8 * indexing is the 24-byte std::vector<int> stride ({int* data, u64 size, u64 cap}). GetLatency(instr) @0x1c8cbc20 is the simpler sibling — latency[instr] bounded by [perf+0x8]. GetResources() @0x1c8cbc00 returns &kResources (the .rodata byte array @0xb43cda8 listing the 28 column indices in fill order).

GOTCHA — the OUTER index is the per-gen ViperfishPerformance::Instruction, not the raw LLO opcode. The mapping is GetViperfishInstruction @0x1c8a3300 (jt @0xb43a104, idx = op−1, bound 0x1a9), and the MXU band fans a single matmul/matprep opcode out to many ordinals via a secondary latch-mode WORD table (@0xb43b140, valid-latch mask 0x3ffcc03). A reimplementation that indexes the grid directly by LLO opcode mis-reads every MXU row.

The matrix-result / Xlu deposit column

One column holds the matrix-result (Xlu) throughput that the convolution cost model reads as its R[2] term. On Viperfish it is res 0x0e (14), confirmed by the dedicated accessor:

function LatencyTableViperfish::GetXluPathReservation(this, value):   // @0x1c8a3200 (verified)
    if value.opcode == 139:                        // kVectorSetPermutePattern, handled directly
        return (value[+0x40] != 0) ? 8 : 1
    instr = GetViperfishInstruction(value)
    return ViperfishPerformance::GetResourceUsage(this.perf /* [this+0x1d0] */, instr, 14)   // res 0x0e

The res = 0x0e immediate is byte-confirmed (GetResourceUsage(*(this+58*8), instr, 14)), and the permute-pattern opcode 139 (kVectorSetPermutePattern) returns a hardcoded 8/1. The Xlu deposit column tracks MXU geometry across gens — res 6 PF → res 0x0e VF → res 0x0f GL → res 0x10 GF — and the populated cell is 8 for the VF matrix-result/transpose-result class (matching the kVectorMatres r22 cell, below).


The 28 Resource Columns

Purpose

The grid's INNER axis is the ViperfishPerformance::Resource enum: the 28 intra-op EUP/MXU/Xlu micro-pipeline reservation ports. There is no Resource::ToString in the binary, so the columns are named functionally — by the LLO-instruction class that deposits cycles into each, the OUTER index being the opcode resolved through GetViperfishInstruction cross-joined to LloOpcodeName::opcode_name @0x21ccfef0. The names are reimplementation-grade in meaning (which physical port each column reserves), not literal symbol names.

The columns

kResources @0xb43cda8 is the 28-byte traversal order 13 18 1a 19 08 02 03 16 09 0a 04 05 06 07 0b 12 0f 14 1b 0c 10 01 15 0d 11 17 00 0e (all of {0..27} exactly once). Naming each column by dominant occupant:

colcellsval(s)occupant LLO band (classifier-named)physical reservation port
r022low-ordinal MXU/setup band (ins 0x0, 0x2)MXU/setup address port A
r122,3kDmaGeneral.. (ins 0x38)DMA address/issue port
r2277MXU matmul band 0xd4..0x106 (MatmulPackedMsk/Lmr)MXU matmul prep/issue port
r3518,16,32MXU matmul/matprep bandMXU matmul throughput port (×51)
r4384,5matprep band + DoneWithGains/LoadLmr (0xd5..0x10a)MXU matprep throughput stage A
r53812,13same bandMXU matprep throughput stage B
r63820,21same bandMXU matprep throughput stage C
r73828,29same bandMXU matprep throughput stage D
r8232kVectorLoadLmr (0x109..0x10a)LMR / gain-load port
r9162,6result-pop FIFO band (0x10b..0x11a)MXU/EUP result-pop FIFO stage A
r10161,5sameresult-pop FIFO stage B
r11163,7sameresult-pop FIFO stage C
r12941,48kVectorPermute/Rotate/BroadcastLane (0x11b..)cross-lane result stage A
r131452,57,59same + reduce bandcross-lane / reduce result stage B
r14238,16TransposeBinary(0x123)/Permute/Rotate/BroadcastXlu / matrix-result deposit (conv R[2])
r155109,117kVectorTransposeBinary (0x11f..0x127)transpose-binary result sub-stage A
r165112,120sametranspose-binary result sub-stage B
r1737,15sametranspose-binary result sub-stage C
r18524kVector{Add,Max,Min,..}ReduceF32 (0x12e..0x132)reduce-result port
r1913kVectorCcfPush (0x134)CCF push port
r2011kVectorPrng (0x13e)PRNG port
r21141kVectorSyncFlag*/kVectorWait*/SfrfPush (0x13f..)sync-flag / wait / SFRF port
r2218kVectorMatres (0x169)matrix-result (Matres) port
r2318kVectorXlaneResult/Transpose-result (0x16a)cross-lane / transpose result port
r2413kVectorCcfPop (0x16b)CCF pop port
r2545kVectorStoreEvenOddSublanes (0x174..0x177)sublane-store port
r2666barnacore/store band (0x178..0x17d)sublane-store / scatter port
r2713kVectorSetRngSeed (0x17f)RNG seed port

Column population (cells/col), summing to 378: r0:2 r1:2 r2:27 r3:51 r4:38 r5:38 r6:38 r7:38 r8:2 r9:16 r10:16 r11:16 r12:9 r13:14 r14:23 r15:5 r16:5 r17:3 r18:5 r19:1 r20:1 r21:14 r22:1 r23:1 r24:1 r25:4 r26:6 r27:1.

NOTE — the ViperfishPerformance::Resource enum (28 columns) is a different, lower-level enum than the 23-slot per-bundle ResourceVector (resource-enum) and a third enum apart from the 19-value MxuResource of mxu-latency-vf. The grid here prices intra-op micro-pipeline-stage holds; ResourceVector is the per-bundle accumulator; MxuResource is the MXU-internal reservation port set. Three resource axes in the same cost model — conflating them is the central trap. The kResources byte array gives this grid's column traversal order, not the ResourceVector slot order.

How VF widens over Pufferfish

Viperfish adds 8 columns over Pufferfish's 20: the 4-stage matprep throughput group (r4..r7, carrying the {4,12,20,28} & {5,13,21,29} step-8 holds) and the transpose-binary result sub-stages (r15..r17), plus CCF push/pop (r19/r24). The matmul-throughput representation itself widened — a single column on PF (res 9, 96 cells) became res r3 (51 cells) plus the 4-stage matprep group on VF.

QUIRK — the r4..r7 matprep stages carry the {4,5}/{12,13}/{20,21}/{28,29} value pairs — the step-8 ramp {4,12,20,28} (and its +1 companion {5,13,21,29}). This same {5,13,21,29} ramp appears as the MxuResource overrun-check insertion in mxu-latency-vf (AddOverrunCheckReservations), reflecting that the matprep throughput stages and the MSR overrun-check ports are two cost-model views of the same per-K-tile latch pipeline. They are different enums in different tables — do not index one with the other.


The Latency Array

Purpose

latency[instr] is the instruction's pipeline depth — the true-dependency edge weight. The VF ctor memsets the array to 0xff then overwrites all 384 slots (the default does not survive).

Histogram

The 384 entries take 19 distinct values:

  1: 148    2: 113    131: 27    121: 25    7: 18    8: 14    6: 8    3: 5
  164: 5    115: 5    114: 4    9: 3    5: 2    4: 2    30: 1    122: 1    0: 1
  36: 1    49: 1

The meaningful clusters:

  • 1/2 (148+113) — cheap vector/scalar/sync ops.
  • 131 / 121 — the MXU matmul / matprep base latencies (the systolic pipeline depth; the MxuLatencyTable reservation arrays say how many of these cycles each MXU sub-port is held).
  • 7 — matprep; 8 — result-pop; 6 — the EUP push latency (instr 0xcc..0xd2; pop instr 0x168 = 1).
  • 164 — transpose-binary; 115 — reduce; 114/122 — permute/rotate/broadcast.
  • 36 — CCF push; 0 — one MXU-setup ordinal; 49 — a single op.

NOTE — the EUP push→pop edge on VF is latency 6 (push) / 1 (pop), and VF prices the EUP push via the latency array alone — the push rows reserve no grid cells (ViperfishTarget::VectorEupReservationCycles @0x1d49b060 = 1, full-rate). This contrasts with Pufferfish, where the EUP push additionally reserves grid ports r2/r3 and runs at half-rate (VectorEupReservationCycles = 2). A reimplementation must not assume the EUP push is grid-priced on VF.


Opcode → Grid Row

The classifier

GetViperfishInstruction @0x1c8a3300 maps an LLO opcode to a grid-row Instruction ordinal via a jump table @0xb43a104 (idx = op−1, bound 0x1a9). Most opcodes map directly (mov ax, IMM16); the MXU band is the exception. The matmul arm @0x1c8a3373 reads a secondary latch-mode WORD table @0xb43b140 (valid-latch mask 0x3ffcc03) that fans the single matmul/matprep opcode out to the ~0x60 band ordinals (VF Instruction 0xd4..0x106 + 0x10b..0x11a). So one vmatmul/vmatprep LLO opcode becomes many grid rows, one per (opcode, latch_mode/MatmulDataFormat) pair.

The grid OUTER index is the opcode for the priced rows (every populated row resolves to a coherent classifier-named LLO opcode), but the mapping is the per-gen classifier, not the raw opcode space. The MXU band's per-ordinal (opcode, latch_mode) decode was read as the latch-mode classifier mechanism, not enumerated ordinal-by-ordinal; the band identity (matmul/matprep), its base latencies (121/131), and its throughput cells (r3 {8,16,32}, r4..r7 4-stage) are byte-exact.

Co-existence with the MxuLatencyTable

The matmul throughput cells in this grid (res r3 = {8,16,32} per MatmulDataFormat — fmt1=8, fmt2=16, fmt6/int8-x8=32; r4..r7 the 4-stage matprep holds) duplicate the matmul-rate magnitudes that also live in the separate MxuLatencyTable reservation matrix. The two cost tables co-exist and price different things: this grid prices the intra-op micro-pipeline-port holds keyed on the Instruction ordinal; the MxuLatencyTable prices the MXU sub-resource occupancy keyed on the MatmulDataFormat/GainLatchMode modifier. The base op latency (121/131) lives in this latency array; the per-MxuResource hold-cycle vector lives in the array<int,19>.

NOTE — the {8,16,32} matmul rate is byte-confirmed in both tables, and the throughput route reads the MxuLatencyTable, not this grid. The grid col-r3 cells {8,16,32} are byte-anchored in the ViperfishPerformance ctor @0x1c8c4840 (mov DWORD PTR [data+0xc], 8/0x10/0x20, 51 cells across the matmul band). The same triple is independently byte-anchored in the MxuLatencyTable ctor @0x1c8a52c0 as {MxuResource 15 (MatmulAccA) → 8/16/32} per format (lines emitting key=15, value=8, then =16, then =32). The cost model's matmul-throughput route, VfCycleTable::GetCyclesForThroughput(CT 0)MxuLatencyTable::GetResourceUsage(0xd4, res 3, 0), remaps res 3 → array[15] and returns that reservation cell — so the consumed {8,16,32} is read from the MxuLatencyTable, while this grid's r3 holds a mirror that the matmul/matprep throughput path does not read for CT-class 0/1/4. A reimplementer must populate both, but wire the matmul-rate consumer to the MxuLatencyTable array[15].

NOTE — the parallel opcode_produced_register_type table (@0x223a16c0, byte[461]) and the convolution-window DMA-level merge gate it drives are gen-invariant — they are documented on performance-overview and apply to VF unchanged, since they key on the raw LLO opcode, not the per-gen Instruction ordinal.


Resource-Count Context

VF's 28 columns sit in the cross-gen progression:

GenCodenameResource colslatency rowsgrid cellspopulated rowsEUP push latXlu deposit col
PFPufferfish203362651807res 6 (conflict-penalty)
VFViperfish283843781286res 0x0e (GetXluPathReservation)
GLGhostlite3147635813213/14res 0x0f
GF6acc60406314652859212res 0x10

The enum widened 20 → 28 → 31 across PF → VF → GL: VF added the 4-stage matprep group (r4..r7) + transpose-binary result sub-stages (r15..r17) + CCF push/pop; GL/GF added the BF16-EUP AndPop FIFO depth + the BarnaCore tail. The matmul throughput port moved (res 9 PF single col → res 3 VF over a 4-stage group), and the Xlu deposit column tracks MXU geometry (res 6 → 0x0e → 0x0f → 0x10).


NameRelationship
performance-overviewthe grid-family object layout, the shared GetResourceUsage read path, and the opcode_produced_register_type gate
mxu-latency-vfthe separate MxuLatencyTable array<int,19> reservation matrix that co-exists with this grid
resource-enumthe 23-slot per-bundle ResourceVector — distinct from this 28-column Resource enum
performance-pf / -gl-ghperf / -gf-ghperfthe 20 / 31 / 31-column sibling grids
slot-mxuthe LLO MXU opcodes the matmul/matprep band rows price

Cross-References