Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

EUP Per-Gen Latency Integers

All addresses on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build-id 89edbbe81c5b328a958fe628a9f2207d). .text/.rodata VMA == file offset. Other versions will differ.

Abstract

The Extended Unary Pipeline (EUP) — libtpu's transcendental approximator — is a deep, FIFO-buffered unit driven by a push in one bundle and drained by a pop one or more bundles later. The scheduler prices the push→pop dependency edge with a single integer: the push latency, the minimum number of bundles the pop may trail the push. That integer is the depth the compiler's VALU-correction software pipeline must hide, so it is the highest-entropy number in the whole transcendental cost model. This page is the byte-level transcription of that integer for every generation that carries a heap-grid Performance object: the exact latencies[Instruction] store the constructor emits, the Instruction-ordinal index it writes, and the resulting push→pop edge weight.

The mechanism is uniform across the newer gens and is the libtpu analog of reading one cell out of an LLVM SchedMachineModel. LatencyTable<Gen>::LatencyBetweenInternal resolves the push opcode to a per-gen Performance::Instruction ordinal (Get<Gen>Instruction), then reads <Gen>Performance::GetLatency(ordinal) = latencies[ordinal] — a flat int32 heap array whose pointer is at Performance[0] and whose element count is at Performance[+0x8]. The constructor fills that array element-wise with mov dword ptr [array + Instruction*4], value stores, after a leading memset(_, 0xff, _) that sentinels every unwritten slot. Recovering the per-gen EUP latency is therefore exactly recovering the value at the EUP-push ordinal in each constructor's fill — which this page does, store by store. The legacy Jellyfish/Dragonfish path does not use this array; it clamps the edge to a flat constant (Performance[+0x30] = 4) and is documented only as a baseline here.

The page is organized one ## unit per Performance class — Pufferfish TensorCore (variant 0), Pufferfish BarnaCore (variant 1), Viperfish, Ghostlite, and the 6acc60406 GhPerf twin — each giving the constructor address, the new size that fixes the array cardinality, the EUP push/pop store offsets, and the integer. The grids these same constructors fill (the GetResourceUsage occupancy matrix) and the opcode→ordinal classifier tables live on the performance-* siblings and the slot page; this page is only the latency rows.

For reimplementation, the contract is:

  • The per-gen EUP push→pop edge integer: PF 7, VF 6, GL 13 (F32) / 14 (BF16), pop 1 — and the legacy clamp (Jf/Df 4) it replaces.
  • The exact latencies[] store: constructor address, new size → array cardinality, Instruction ordinal, byte offset (ordinal*4), and value.
  • That the EUP push value is uniform across all classified EUP functions on each gen (one transcendental latency per datatype), so the cost model uses a single constant, not a per-function table.
  • The Pufferfish variant<TensorCore, BarnaCore> split: the EUP edge is the TensorCore array (7); BarnaCore is a separate 134-entry array with its own 6-cycle EUP block.
  • That this latency edge is orthogonal to VectorEupReservationCycles (the EUP issue rate) — it is never multiplied by it.
Read path<Gen>Performance::GetLatency(instr) = latencies[instr] (int32, ptr@Perf[0], count@Perf[+0x8])
Edge selectorLatencyTable<Gen>::LatencyBetweenInternalGet<Gen>Instruction(push)GetLatency
PF push / pop7 (Instr 0x67..0x6c) / 1 (Instr 0x76) — PufferfishPerformance ctor @0x1c8be080
VF push / pop6 (Instr 0xcc..0xd2) / 1 (Instr 0x168) — ViperfishPerformance ctor @0x1c8c4840
GL push / pop13 F32 (Instr 0x106..0x10f) · 14 BF16 (Instr 0x110..0x118) / 1 (Instr 0x1c4) — ctor @0x1c8cbc80
Legacy clampJf/Df push→pop = 4 (Performance[+0x30], not a per-instruction array)
ReservationVectorEupReservationCycles JF/VF/GL = 1, PF = 2 — orthogonal to the latency (never multiplied)

The Lookup and Its Two Anchors

Purpose

Every per-gen latency integer on this page is read by one tiny accessor and written by one constructor. Pinning both is what makes the integer reimplementation-grade: the accessor proves how the ordinal indexes the array, and the constructor store proves the value at that ordinal.

The Accessor

<Gen>Performance::GetLatency is byte-identical across the three newer gens — a bounds-checked int32 array load:

int32 GetLatency(Performance* p, u32 instr):   // VF @0x1c8cbc20, PF-TC @0x1c8c3860,
    if (p->count <= instr)                      //   GL @0x1c8d36e0, PF-BC @0x1c8c47e0
        BUG()                                   // p->count is Performance[+0x8]
    return p->latency[instr]                    // p->latency is Performance[0]; int32 stride

So the EUP push edge is latency[Get<Gen>Instruction(push_opcode)]. The classifier Get<Gen>Instruction maps the LLO push opcode (0x128..0x13a) and the 0x14e pop to a per-gen Instruction ordinal; those ordinals are the indices this page reads. The classifier tables (VF/PF dense jump tables, GL's 258-entry sorted (u16,u16) pair table) are transcribed on the slot page; here only the resulting ordinals matter.

The Constructor Store

Each Performance constructor allocates the latency array with operator new(size) (so size/4 is the cardinality), memsets it to 0xff, then overwrites slots with mov dword ptr [array + ordinal*4], value. The EUP rows are a contiguous run of identical-value stores. Reading the EUP latency is reading the value immediate at the EUP-push offset:

ctor:
  rax = operator new(SIZE)        ; SIZE/4 = Instruction cardinality
  Perf[0]   = rax                 ; latency array pointer
  Perf[+8]  = Perf[+0x10] = COUNT ; element count (== SIZE/4)
  memset(rax, 0xff, SIZE)         ; sentinel = 0xffffffff
  ...
  cmp [Perf+8], ORDINAL ; jbe BUG ; rax = [Perf] ; mov [rax + ORDINAL*4], VALUE

NOTE — the byte offset in a store ([rax+0x418], or [rax+412LL] in the decompiler's decimal) is always ordinal*4. Throughout this page the offset is given in hex and the ordinal in the same row, so a reimplementer can cross-check either way: 0x418 = 0x106*4, 0x19c = 0x67*4, 0x330 = 0xcc*4.

QUIRK — on PF the memset sentinel (0xff0xffffffff) never reaches a consumer because all 336 slots are overwritten. On GL/GF the 0xff survives on unpriced rows. Either way the EUP rows are explicit stores, never the sentinel — a reimplementation must write them.


Pufferfish — TensorCore Variant (push 7)

Purpose

PufferfishPerformance is the TensorCore arm (variant 0) of the variant<PufferfishPerformance, PufferfishBarnaCorePerformance> that LatencyTablePufferfish prices. Every EUP opcode classifies to variant 0, so the Pufferfish EUP push→pop edge is the TensorCore array value.

Edge Integer

RoleLLO opcodeInstructionByte offsetValue
EUP push (rsqrt)0x12c0x670x19c7
EUP push (pow2)0x1290x680x1a07
EUP push (log2)0x12b0x690x1a47
EUP push (tanh)0x1280x6a0x1a87
EUP push (recip)0x12a0x6b0x1ac7
EUP push (pushErf)0x1310x6c0x1b07
EUP pop0x14e0x760x1d81

PufferfishPerformance ctor @0x1c8be080: operator new(0x540) = 1344 B = 336 int32; Perf[+8] = Perf[+0x10] = 336; memset(array, 0xff, 1344). The six EUP-push stores are contiguous (mov [rax+412LL]=7[rax+432LL]=7, decimal 412 = 0x19c432 = 0x1b0), and the pop store is mov [rax+472LL]=1 (472 = 0x1d8). The value is uniform 7 across all six classified F32 EUP functions.

NOTE — the Pufferfish kVectorSigShftF32 (0x12d) push falls to the classifier default — GetPufferfishInstruction emits no variant-0 ordinal for it, unlike the other six F32 pushes. Its effective edge is most plausibly 7 by the uniform-EUP pattern, but the ordinal was not pinned (INFERRED). All six classified PF EUP functions are 7.


Pufferfish — BarnaCore Variant (push 6)

Purpose

Pufferfish is the last generation that ships a BarnaCore embedding engine. LatencyTablePufferfish holds two grids in a std::variant; the 2-arm __fmatrix LatencyFromInstruction visitor (@0x21c203d0) selects variant 1 — PufferfishBarnaCorePerformance — for the legacy embedding-engine ops. Variant 1 is a separate 134-entry array with its own, lower, EUP block.

Variant Dispatch

The variant index is the high 16 bits of GetPufferfishInstruction's return; LatencyTablePufferfish::LatencyBetweenInternal (@0x1c8a2aa0) shifts it (shr r14d,0x10) and calls fmatrix[variant]. The two dispatch arms differ in how they read the Instruction:

dispatch<0>(holder):   // @0x1c8a3140 — TensorCore
    return PufferfishPerformance::GetLatency(holder[0], (u16)*instr)   // 16-bit ordinal

dispatch<1>(holder):   // @0x1c8a3160 — BarnaCore
    return PufferfishBarnaCorePerformance::GetLatency(holder[8], (u8)*instr)  // 8-bit ordinal

The BarnaCore arm reads the Instruction as a byte (unsigned __int8 *) and loads the BarnaCore holder from [holder+8] (vs [holder+0] for TensorCore) — byte-confirmed in the <1ul> dispatcher body. Every EUP push opcode emits variant 0, so the Pufferfish EUP edge is 7; the BarnaCore block below is reached through the PufferfishBarnaCoreChannelEmitter, not through the LLO-level EUP push, and prices the legacy embedding-engine transcendentals.

Edge Integer

RoleInstructionByte offsetValue
BarnaCore EUP/transcendental block0x77..0x7c (6 entries)0x1dc..0x1f06
kBarnaCoreScalarSyncDoneRead0x3d0xf43
kBarnaCoreVectorStore (memory op)0x850x21412
no-op / null slot0x010x040

PufferfishBarnaCorePerformance ctor @0x1c8c38c0: operator new(0x218) = 536 B = 134 int32; memset(array, 0xff, 536); Perf[+8] = 134. The EUP block is six contiguous mov [rax+476LL]=6[rax+496LL]=6 stores (decimal 476 = 0x1dc496 = 0x1f0); the VectorStore store is [rax+532LL]=12 (532 = 0x214). The 134-entry array is dominated by 1-cycle scalar/sync ops; the recovered non-default values are six 6 (the EUP block), one 12 (VectorStore), two 3, two 4, and one 0 — the legacy embedding-engine latency model.

QUIRK — the BarnaCore EUP block is 6 cycles, one less than the TensorCore EUP (7). The 6-entry width matches the 6-function TensorCore EUP block shape, and a cheaper EUP is structurally consistent with the smaller 134-entry BarnaCore ISA. The exact BarnaCore-Instruction→function name for 0x77..0x7c was not pinned — those ordinals are reached via the channel emitter, not GetPufferfishInstruction, and there is no BarnaCoreInstructionToString (INFERRED that the block is the embedding-engine transcendentals).


Viperfish (push 6)

Purpose

ViperfishPerformance is the first single-grid newer gen — no variant, no BarnaCore. LatencyTableViperfish::LatencyBetweenInternal (@0x1c8a4ac0) runs GetViperfishInstruction(push)GetLatency directly.

Edge Integer

RoleLLO opcodeInstructionByte offsetValue
EUP push (rsqrt)0x12c0xcc0x3306
EUP push (pow2)0x1290xcd0x3346
EUP push (log2)0x12b0xce0x3386
EUP push (tanh)0x1280xcf0x33c6
EUP push (sigshft)0x12d0xd00x3406
EUP push (recip)0x12a0xd10x3446
EUP push (pushErf)0x1310xd20x3486
EUP pop0x14e0x1680x5a01

ViperfishPerformance ctor @0x1c8c4840: operator new(0x600) = 1536 B = 384 int32; Perf[+8] = Perf[+0x10] = 0x180 (384); memset(array, 0xff, 1536). The seven EUP-push stores are contiguous mov dword ptr [rax+0x330],6[rax+0x348],6 (@0x1c8c5d4a..@0x1c8c5dec), and the pop store is mov dword ptr [rax+0x5a0],1 (@0x1c8cadff). Viperfish classifies all seven F32 EUP pushes (including sigshft, unlike PF) and the value is uniform 6.

NOTE — Viperfish has SupportsBf16AluInstructions FALSE, so the late decomposer widens a BF16 transcendental to the F32 push; the BF16 EUP opcodes (0x132..0x13a) fall to the classifier default and there is no distinct BF16 latency slot. VF therefore has a single EUP latency (6), where Ghostlite splits it.


Ghostlite (push 13 F32 / 14 BF16)

Purpose

GhostlitePerformance is the production V5e/V6e-class table. Ghostlite has a native BF16 ALU (SupportsBf16AluInstructions TRUE), so its classifier maps both the F32 pushes and the BF16 pushes to distinct latency slots — the only gen that splits the EUP latency by datatype. LatencyTableGhostlite::LatencyBetweenInternal (@0x1c8b22e0) → GetGhostliteInstruction (sorted (u16,u16) pair table @0x4067dc8) → GetLatency.

The Full 9-Transcendental Block

The constructor fills a contiguous F32 run (Instruction 0x106..0x10f, all 13) and a contiguous BF16 run (0x110..0x118, all 14). There is no per-function deviation: tanh = pow2 = log2 = rsqrt = sigshft = recip = pushErf = sinq = cosq = erf, identical per datatype. The cost model can size the VALU-correction window to a single GL EUP constant (13/14), not a per-function table.

FunctionF32 push opcF32 InstrF32 latBF16 push opcBF16 InstrBF16 lat
rsqrt0x12c0x106130x1360x11014
pow2 (2^x)0x1290x107130x1330x11114
log20x12b0x108130x1350x11214
tanh0x1280x109130x1320x11314
sigshft0x12d0x10a130x1370x11414
recip0x12a0x10b130x1340x11514
pushErf0x1310x10c13
sinq0x12e0x10d130x1380x11614
cosq0x12f0x10e130x1390x11714
erf0x1300x10f130x13a0x11814
EUP pop0x14e0x1c41(same)(same)1

All rows CERTAIN. F32 has 10 entries (0x130 kVectorErfF32 and 0x131 kVectorPushErf are both classified); BF16 has 9 (no separate push-form erf).

Edge Integer

GhostlitePerformance ctor @0x1c8cbc80 (clean symbol _ZN3xla9ghostlite20GhostlitePerformanceC1Ev): operator new(0x770) = 1904 B = 476 int32; Perf[+8] = Perf[+0x10] = 0x1dc (476); memset(array, 0xff, 1904). Three contiguous store runs:

BlockOrdinalsByte offsetsValueStore window
F32 EUP push (10)0x106..0x10f0x418..0x43c0xd (13)@0x1c8cd80f..@0x1c8cd902
BF16 EUP push (9)0x110..0x1180x440..0x4600xe (14)@0x1c8cd91d..@0x1c8cd9f5
EUP pop0x1c40x7101@0x1c8d2864

NOTE — The GL F32 EUP fill is ten contiguous slots 0x106..0x10f, all 13, including sinq/cosq/erf F32 (0x10d..0x10f), byte-confirmed at offsets 0x418..0x43c; the 9 BF16 slots 0x110..0x118 are all 14. Every EUP function shares one per-precision latency — there is no per-function outlier (no erf/sigshft deviation), so GL needs no per-function EUP latency table.

NOTE — the BF16 push costs one extra cycle (14 vs 13). This is the latency-level reflection of GL keeping the 16-bit lane (SupportsBf16AluInstructions TRUE): GL pays a distinct BF16 EUP latency where VF/PF unpack to F32 and pay the single F32 latency. The 192/182 and 212/204 figures on the performance-gl / performance-gf pages are the EUP-prep grid-band cost magnitudes — a different quantity from this push→pop latency edge.


6acc60406 GhPerf Twin (push ~10 F32 / ~11 BF16 — LOW)

Purpose

The 6acc60406-line (GF) GhPerf object is built by a distinct constructor sub_1C8D3740, structurally the same GhostlitePerformance layout but with a 465-row instruction set (vs GL's 476). Its cycle-table twin GfcCycleTable is registered immediately after GlcCycleTable (ghostlite) in the cycle_table.cc FunctionRegistry, keyed to the post-ghostlite TpuVersion. It fills its own latency array — it does not share GL's instance — and the EUP-shaped block carries different integers from GL.

Edge Integer

BlockByte offsets (same as GL)Value
F32 EUP-shaped run (head, 3 slots)0x410..0x4182
F32 EUP-shaped run (rest, 9 slots)0x41c..0x43c0xa (10)
BF16 EUP-shaped run (9 slots)0x440..0x4600xb (11)
post-BF16 tail0x464..0x46c1
pop-position slot0x7102

sub_1C8D3740: operator new(0x744) = 1860 B = 465 int32 (one row short of GL's 476); Perf[+8] = Perf[+0x10] = 0x1d1 (465); memset(_, 0xff, 0x744). The latency fill differs from GL's. The stores at 0x410..0x460 are byte-exact and contiguous: three mov [rax+off],2 at 0x410/0x414/0x418 (@0x1c8d5367..@0x1c8d539d), nine mov [rax+off],0xa at 0x41c..0x43c (@0x1c8d53b8..@0x1c8d5490), nine mov [rax+off],0xb at 0x440..0x460 (@0x1c8d54ab..@0x1c8d5583), then 1s from 0x464; the pop-position slot 0x710 is 2 (@0x1c8d97d1).

GOTCHA — the GF values are byte-exact at the offsets, but the GF opcode→Instruction classifier was not traced in this analysis, and the F32 EUP-shaped run is not uniform: the head is three slots (0x410/0x414/0x418) of value 2, then nine slots of 10 (0x41c..0x43c). That split breaks the single-value-per-datatype shape GL has, which is evidence the GF Instruction enum is not 1:1 with GL's (465 vs 476 rows shifts the mapping), so reading 0x418 as "GF rsqrt" is unsound. Do not assume the GF EUP push edge is 10/11 by offset analogy. Until the GF classifier is decoded, the binding of these offsets to the EUP transcendentals is LOW; only LatencyTableGhostlite (one instance, @0x1c8b22e0) is confirmed to route the Ghostlite push→pop edge through GhostlitePerformance::GetLatency.


Legacy Baseline — Jellyfish / Dragonfish (clamp 4)

The pre-Pufferfish family does not use a per-instruction latency array for the EUP edge. LatencyTableJellyfish::LatencyBetweenInternal (@0x1c8a0d60), when the first opcode is a bare push (∈ [0x128, 0x13a]) and the second is the 0x14e pop, clamps the edge to this+0x1c, which the constructor copies from Performance[+0x30] = 4. Dragonfish inherits the same clamp. So the EUP push→pop edge is a flat 4 on Jf/Df — the same notion (minimum push→pop bundle gap) expressed through a single per-table field instead of a per-Instruction array. This is the only EUP latency on these gens; the array mechanism is a Pufferfish-and-later construct.


Per-Gen Consolidated Table

GenEUP push latencyPop latencyMechanismConstructorPush byte anchor
Jellyfish4(clamp)Performance[+0x30] = 4 clampLatencyTableJellyfishn/a (flat field)
Dragonfish4 (inherits Jf)(clamp)PerformanceDf keeps +0x30 = 4(inherits)n/a
Pufferfish (TensorCore)71latency[0x67..0x6c]@0x1c8be080[rax+0x19c..0x1b0]=7
Pufferfish (BarnaCore)6(via channel emitter)latency[0x77..0x7c]@0x1c8c38c0[rax+0x1dc..0x1f0]=6
Viperfish61latency[0xcc..0xd2]@0x1c8c4840[rax+0x330..0x348]=6
Ghostlite13 (F32) / 14 (BF16)1latency[0x106..0x118]@0x1c8cbc80[rax+0x418..0x43c]=0xd / [0x440..0x460]=0xe
6acc60406 (GF, LOW)~10 / ~11 (offset-only)~2latency[0x410..0x460]sub_1C8D3740[rax+0x41c..0x43c]=0xa / [0x440..0x460]=0xb

Every value above is the push latency — the push→pop dependency-graph edge weight. The pop's own latency (1 on PF/VF/GL) is what the drained EUP result carries downstream once popped.


Latency vs Reservation — Not a Product

The EUP push→pop edge is bounded by two independent quantities read from two different arrays, and a reimplementer who multiplies them gets the wrong schedule:

QuantityBoundsSourcePFVFGL (F32/BF16)JF
push→pop data latencymin bundles push → its drain (pop)latency[Get<Gen>Instr(push)]7613 / 144 (clamp)
pop latencylatency the drained value carrieslatency[pop Instr]111
VectorEupReservationCyclesmin bundles push → next pushTarget accessor (vtable +0x480)2111

LatencyTable::LatencyBetween (@0x1c89f820) calls the per-gen LatencyBetweenInternal (call [rax+0x18]), optionally adds a uniform-random jitter, special-cases only the matres/transpose opcodes (0x82/0x84) with an MXU floor, and returns the edge unchanged for the EUP push. There is no multiply by any reservation field anywhere on the path. The per-gen LatencyBetweenInternal main paths (GL general arm, VF wrapper @0x1c8a4480) apply a transpose-latch halving (latency/2, with a (lat & 0x80000001)==1 round-up correction) gated on LatchModeIsTranspose(operand)if (!LatchModeIsTranspose(...)) latency = latency/2 + .... The EUP push is not a transpose/latch-mode op, so the halving never applies to it and latency[Instruction] passes through verbatim.

VectorEupReservationCycles is the orthogonal EUP-unit issue occupancy: how many bundles the EUP resource stays reserved after a push, applied by the per-instruction resource model (GetResourceUsage matrix + the SlotTracker reservation), not by the latency edge. The composition is max(latency-deadline, resource-availability), never a product: a pop is placed no earlier than push_bundle + latency; consecutive pushes are no closer than reservation bundles.

GOTCHA — The latency edge is not scaled by the reservation. Pufferfish's half-rate EUP (reservation 2) bounds the issue rate (one push per 2 bundles), not the push→pop window — the push returns latency[Instruction] unmodified. The VALU-correction software-pipeline depth the decomposer must fill equals the latency (PF 7), independent of the reservation. A chain of N independent transcendentals on PF is EUP-bound at ≈ 2·(N−1) + 7 bundles (issue-rate-limited spacing plus the final drain), not 7·2.


Function Map

FunctionAddressRole
LatencyTable::LatencyBetween0x1c89f820dispatcher; returns EUP edge unmodified
LatencyTableJellyfish::LatencyBetweenInternal0x1c8a0d60Jf/Df EUP clamp to Performance[+0x30] = 4
LatencyTablePufferfish::LatencyBetweenInternal0x1c8a2aa0variant select via shr r14d,0x10fmatrix
LatencyTableViperfish::LatencyBetweenInternal0x1c8a4ac0VF EUP edge via GetViperfishInstructionGetLatency
LatencyTableGhostlite::LatencyBetweenInternal0x1c8b22e0GL/GF EUP edge via GetGhostliteInstructionGetLatency
__fmatrix LatencyFromInstruction visitor0x21c203d02-arm variant<TensorCore,BarnaCore> dispatch
dispatch<0ul> / dispatch<1ul>0x1c8a3140 / 0x1c8a3160TensorCore (u16 ordinal) / BarnaCore (u8 ordinal)
PufferfishPerformance ctor0x1c8be080fills TensorCore EUP push = 7, pop = 1
PufferfishBarnaCorePerformance ctor0x1c8c38c0fills BarnaCore EUP block = 6, VectorStore = 12
ViperfishPerformance ctor0x1c8c4840fills EUP push = 6, pop = 1
GhostlitePerformance ctor0x1c8cbc80fills F32 EUP = 13, BF16 = 14, pop = 1
sub_1C8D3740 (GF GhPerf ctor)0x1c8d3740fills 465-row array; EUP-shaped block 10/11 (binding LOW)
<Gen>Performance::GetLatency0x1c8cbc20 (VF), 0x1c8c3860 (PF-TC), 0x1c8d36e0 (GL), 0x1c8c47e0 (PF-BC)latency[Instruction] heap lookup
GetGhostliteInstruction0x1c8b1740sorted (u16,u16) pair table @0x4067dc8 (258 entries)

Cross-References