EUP Latency Overview
Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel. Other versions differ.
Abstract
The EUP — libtpu's transcendental approximator, the "XLU" of the VPU — is a deep, FIFO-buffered pipeline that a push enters in one bundle and a pop drains some bundles later. Because a TensorCore bundle is the issue packet and there is no runtime hazard interlock, the compiler must know exactly how many bundles separate a push from its drainable result, and it must place a kVectorEupResultValue pop no earlier than that. That separation is the EUP push→pop data latency, and it is the single number the bundle scheduler raises the push→pop dependency edge to. This page is the cost-model framing of that number: where it is stored, how it is read on each generation, how it composes with the orthogonal EUP issue reservation, and what a reimplementer must reproduce to schedule a transcendental correctly.
The latency model is the cost-side dual of the ISA-side push/pop encoding. The EUP / Transcendental Slot page owns how a transcendental is encoded (the Alu3 slot-3 push, the function selector, the 0x14e pop, the fused-AndPop-vs-split duality); this page owns when the pop may be scheduled. The mechanism is a per-generation cost table read by the dependency graph: LatencyTable::LatencyBetween(push, pop) calls a per-gen LatencyBetweenInternal, which on the v4+ generations classifies the LLO push opcode to a per-gen Performance::Instruction ordinal and reads Performance::GetLatency(Instruction) = latencies[Instruction] from a heap int32 array. The legacy Jellyfish/Dragonfish path does not have that array; it clamps the edge to a fixed Performance[+0x30] field instead. Either way, the value returned is the minimum bundle gap from the push to its pop — exactly the software-pipeline depth the interleaved VALU correction arithmetic must fill.
The number a reimplementer is tempted to confuse it with is VectorEupReservationCycles, the EUP unit's issue rate (how many bundles the unit stays reserved after a push, bounding back-to-back pushes). The two are read from different places, bound different gaps, and do not multiply: the latency is the push→pop deadline, the reservation is the push→push throughput, and the scheduler composes them as max(latency-deadline, resource-availability). Pufferfish is the one generation where they diverge — a 7-cycle latency with a half-rate (2-cycle) reservation — which is precisely the case that breaks a naive latency × reservation model. The per-gen latency integers, the correction polynomials, the Payne-Hanek reduction, and the lane-width fan-out each get their own page; this one explains the edge they all plug into.
For reimplementation, the contract is:
- The two-array model. Each v4+
Performanceobject holds a flatlatencies[Instruction]heap array (the data-latency edge) and aresourceUsage[Instruction][Resource]2D grid (the issue reservation).GetLatencyreads the first; the reservation comes from the second plus theVectorEupReservationCyclesaccessor. They are physically distinct arrays. - The read path.
LatencyBetween→ per-genLatencyBetweenInternal→Get<Gen>Instruction(push)→GetLatency(Instruction), with the EUP push not in the transpose/MXU floor set, so the raw value passes through unmodified. - The per-gen latency integers (PF 7, VF 6, GL 13 F32 / 14 BF16, JF/DF 4-clamp) and the uniform-across-functions, split-by-datatype structure — owned in full by EUP Per-Gen Integers.
- The reservation orthogonality.
VectorEupReservationCycles(PF 2, VF/GL/JF 1) bounds push→push spacing, not the push→pop window; the composition is a max, not a product.
| Edge dispatcher | LatencyTable::LatencyBetween @0x1c89f820 — virtual LatencyBetweenInternal at vtable +0x18 |
| v4+ read path | Get<Gen>Instruction(push) → <Gen>Performance::GetLatency(Instr) = latencies[Instr] |
GetLatency | VF @0x1c8cbc20, PF @0x1c8c3860, GL @0x1c8d36e0 — identical bounds-checked latencies[idx] |
| Push→pop latency | JF/DF 4 (clamp); PF 7; VF 6; GL 13 (F32) / 14 (BF16); pop drains at latency 1 |
| JF/DF clamp | LatencyTableJellyfish::LatencyBetweenInternal @0x1c8a0d60 — from∈[0x128,0x13a] ∧ to==0x14e → field +0x1c |
| Issue reservation | VectorEupReservationCycles (Target vtable +0x480): JF/VF/GL = 1, PF = 2 (half-rate) |
| Composition | max(push_bundle + latency, prev_push + reservation) — latency ⊥ reservation, never a product |
| Software-pipeline depth | the latency (PF 7), independent of the reservation |
The Push→Pop Latency Edge
Purpose
A transcendental is a two-instruction sequence: a push that reserves the EUP and a pop that retrieves the result N bundles later. The scheduler must not place the pop before push_bundle + N, or it would read an in-flight (not-yet-drained) result. N is the push→pop data latency — the EUP pipeline depth — and it is the dependency-graph edge weight from push to pop. The intervening N bundles are where the late decomposer interleaves VALU correction arithmetic (Newton steps, range reconstruction) or unrelated work, exactly as a CPU back end hides a long-latency divide. Because the bundle is the issue packet, this latency is not a runtime hazard the hardware stalls on; it is a compile-time scheduling constraint the cost model encodes.
The Two Mechanisms
There are two ways the edge is computed, split at the Jellyfish/Pufferfish silicon boundary. The legacy path clamps; the modern path reads a per-instruction array.
function LatencyBetweenInternal_jellyfish(from, to): // @0x1c8a0d60 (JF/DF)
edge = 1
// ... RPU / IndexedStore / matres floors elided ...
is_eup_push = (from.opcode - 0x128) <= 0x12 // from ∈ [0x128, 0x13a]
if (is_eup_push && to.opcode == 0x14e) // 0x14e = kVectorEupResultValue (pop)
|| LloOpcodeIsPseudoEupInstruction(from.opcode): // or a fused AndPop op
if edge <= this[+0x1c]: // clamp UP to the EUP-latency field
edge = this[+0x1c] // = Performance[+0x30] = 4
return edge
The clamp field this+0x1c is copied by the LatencyTableJellyfish ctor from Performance[+0x30] = 4 — the flat-POD generations have no per-function transcendental latency, just one constant. (Performance[+0x30] lives in the JF/DF inline-POD model of the Performance Family Overview; on Jellyfish it is one of the .rodata-block-copied cells, not a scalar store.)
The v4+ generations route through a per-instruction heap array instead. The structure is uniform across Pufferfish, Viperfish, and Ghostlite:
function LatencyBetweenInternal_modern(from, to): // PF @0x1c8a2aa0 / VF @0x1c8a4ac0 / GL @0x1c8b22e0
instr = Get<Gen>Instruction(from) // LLO opcode → per-gen Instruction ordinal
base = <Gen>Performance::GetLatency(instr) // latencies[instr] from the heap array
// a transpose-latch HALVING (base/2) applies ONLY to the matmul-latch region
// (GL/VF opcodes 0x8d..0x96 / 0x73..0x7c) — the EUP push (0x128..0x13a) is excluded
// XLU / MXU max-combines only RAISE base on a conflict — the EUP push has none
return base // EUP push: passes through unmodified
GetLatency is the same four-instruction lookup on every modern gen — a bounds check against the element count at Performance[+0x8], then latencies[instr] from the pointer at Performance[+0x0]:
function <Gen>Performance::GetLatency(perf, instr): // VF @0x1c8cbc20, PF @0x1c8c3860, GL @0x1c8d36e0
if perf.latency_count <= instr: // [perf+0x8]
trap() // ud2 / BUG
return perf.latency_ptr[instr] // *(int32*)([perf+0x0] + instr*4)
GOTCHA — the EUP push is the latency-carrying instruction; the pop is a 1-cycle drain. The push→pop dependency edge is computed as
LatencyBetweenInternal(first = push, second = pop), and the per-gen path evaluatesGetLatencyon the first operand — the push. So the edge weight equals the push latency (6/7/13/14). The0x14epop's ownlatencies[pop] = 1is not the push→pop gap; it is the latency the drained value carries to whatever instruction consumes it downstream. A reimplementer who readsGetLatency(pop)for the push→pop edge gets 1 and schedules the pop one bundle after the push — reading garbage.
Per-Gen Latency Values
The push value is uniform across every classified EUP function on a given generation — tanh = pow2 = recip = log2 = rsqrt = sigshft = sinq = cosq = erf — so the EUP unit has a single transcendental latency per datatype, not a per-function table. Ghostlite is the only generation that splits F32 from BF16 (it keeps the 16-bit BF16 lane, so the BF16 transcendental costs one extra cycle); Viperfish and Pufferfish classify only the F32 pushes and widen BF16 EUP to the F32 push, so they have one latency.
| Gen | TpuVersion | EUP push latency | Pop latency | Mechanism | Byte anchor |
|---|---|---|---|---|---|
| Jellyfish | v2 | 4 | (clamp) | Performance[+0x30] = 4, clamped in LatencyBetweenInternal | clamp field +0x1c |
| Dragonfish | v3 | 4 | (clamp) | inherits JF +0x30 = 4 | (= JF) |
| Pufferfish | v4 | 7 | 1 | PufferfishPerformance latencies[] | ctor [ptr+0x19c..0x1b0]=7, [+0x1d8]=1 |
| Viperfish | v5p | 6 | 1 | ViperfishPerformance latencies[] | ctor [ptr+0x330..0x348]=6, [+0x5a0]=1 |
| Ghostlite | v6e | 13 (F32) / 14 (BF16) | 1 | GhostlitePerformance latencies[] | ctor [ptr+0x418..0x43c]=0xd, [+0x440..0x460]=0xe, [+0x710]=1 |
The byte anchors above were confirmed cell-by-cell against the constructors: Pufferfish's PufferfishPerformance::PufferfishPerformance @0x1c8be080 stores latencies[0x67..0x6c] = 7 (six EUP-push ordinals) and latencies[0x76] = 1 (the pop ordinal); Ghostlite's ctor stores latencies[0x106..0x10f] = 13 (ten F32 pushes), latencies[0x110..0x118] = 14 (nine BF16 pushes), latencies[0x1c4] = 1; Viperfish's stores latencies[0xcc..0xd2] = 6 and latencies[0x168] = 1. The per-gen opcode→Instruction classifier that turns an LLO push opcode into the array index, and the full nine-function block, are owned by EUP Per-Gen Integers.
QUIRK — Pufferfish's latency table is a
std::variant<PufferfishPerformance, PufferfishBarnaCorePerformance>dispatched by a 2-arm__fmatrixvisitor (LatencyFromInstruction@0x21c203d0); the high 16 bits ofGetPufferfishInstruction's return select the variant (shr r14d,0x10thencall [fmatrix + idx*8]). Every EUP opcode emits variant 0 (TensorCore), so the EUP edge is the TensorCore array = 7 — not the BarnaCore variant, which carries its own lower-latency EUP block (= 6 at ordinals0x77..0x7c, byte-confirmed in thePufferfishBarnaCorePerformancector@0x1c8c38c0) for the legacy embedding engine. Pufferfish ships both because BarnaCore retires only after Pufferfish; the TensorCore transcendental never reaches the 6-cycle BarnaCore block.
Function Map
| Function | Address | Role |
|---|---|---|
LatencyTable::LatencyBetween | 0x1c89f820 | edge dispatcher; calls virtual LatencyBetweenInternal, floors only 0x82/0x84 |
LatencyTableJellyfish::LatencyBetweenInternal | 0x1c8a0d60 | JF/DF EUP clamp to field +0x1c (= Performance[+0x30] = 4) |
LatencyTablePufferfish::LatencyBetweenInternal | 0x1c8a2aa0 | PF path; __fmatrix variant dispatch (EUP = variant 0) |
LatencyTableViperfish::LatencyBetweenInternal | 0x1c8a4ac0 | VF path; GetViperfishInstruction → GetLatency |
LatencyTableGhostlite::LatencyBetweenInternal | 0x1c8b22e0 | GL path; GetGhostliteInstruction → GetLatency(this+0x1d0) |
<Gen>Performance::GetLatency | 0x1c8c3860 (PF), 0x1c8cbc20 (VF), 0x1c8d36e0 (GL) | bounds-checked latencies[Instruction] |
PufferfishPerformance ctor | 0x1c8be080 | fills EUP=7 (TensorCore variant 0), pop=1 |
ViperfishPerformance ctor | 0x1c8c4840 | fills EUP=6, pop=1 |
GhostlitePerformance ctor | 0x1c8cbc80 | fills F32 EUP=13, BF16=14, pop=1 |
Latency vs Reservation: Why They Do Not Multiply
Purpose
The push→pop pair is bounded by two independent quantities read from two different arrays. Conflating them is the single most common way to mis-schedule a transcendental, so the cost model keeps them physically separate: the data latency (push→pop deadline) lives in latencies[Instruction]; the issue reservation (push→push throughput) comes from VectorEupReservationCycles and the per-instruction resourceUsage grid. A reimplementer who multiplies them over-counts every EUP-bound schedule on Pufferfish.
The Edge Passes Through Unmodified
LatencyBetween does not scale the edge by any reservation field. It calls the per-gen LatencyBetweenInternal, optionally adds uniform-random jitter (a noise model gated on a flag at [this+0x10]), then special-cases only the matres/transpose opcodes — 0x82 (matres) clamps an edge below 3 up to 2, and 0x84 (transpose) takes an MXU AutoProto floor — and otherwise returns the internal value verbatim. The EUP push is neither 0x82 nor 0x84, so its edge passes through untouched:
function LatencyTable::LatencyBetween(this, from, to): // @0x1c89f820
edge = this->LatencyBetweenInternal(from, to) // virtual, vtable +0x18
if noise_model_enabled([this+0x10]):
edge += UniformRandom(0, 101) // optional scheduling jitter
if from.opcode == 0x84 && to.opcode == 0x84: // transpose-only MXU floor
edge = max(edge, MxuProtoFloor(from))
else if from.opcode == 0x82 && (to.opcode - 0x82) <= 2 && edge < 3:
edge = 2 // matres-only floor
return edge // EUP push: returned unchanged
// NO multiply by VectorEupReservationCycles anywhere on this path
The Composition
The reservation is the EUP unit's issue occupancy — how many bundles the unit stays reserved after a push, applied by the bundle packer's per-instruction resource model (the resourceUsage grid plus the SlotTracker), not by the latency edge. The two constraints compose as a max:
| Quantity | Bounds | Source | PF | VF | GL (F32/BF16) | JF |
|---|---|---|---|---|---|---|
| push→pop data latency | min bundles push → its drain | latencies[Get<Gen>Instr(push)] | 7 | 6 | 13 / 14 | 4 (clamp) |
| pop drain latency | latency the drained value carries | latencies[pop Instr] | 1 | 1 | 1 | — |
VectorEupReservationCycles | min bundles push → next push | Target accessor (+0x480) | 2 | 1 | 1 | 1 |
bundle(pop) >= bundle(push) + latency // deadline (e.g. push+7 on PF)
bundle(push_i+1) >= bundle(push_i) + VectorEupReservationCycles // throughput (e.g. +2 on PF)
// → effective gap = max(...), not product
QUIRK — Pufferfish's half-rate EUP (reservation = 2) does not double the 7-cycle push→pop window. It halves the issue rate: one new push may enter the EUP every 2 bundles. The VALU-correction software-pipeline depth the late decomposer must fill is the latency (7), independent of the reservation. For a chain of
Nindependent transcendentals on Pufferfish the EUP-bound schedule length is≈ 2·(N−1) + 7bundles (throughput-bound spacing between pushes, plus one final 7-cycle drain), not7·2. On Viperfish and Ghostlite, where reservation = 1, the two constraints coincide for back-to-back pushes and only the latency tail (6 / 13 / 14) matters.
The VectorEupReservationCycles Accessor
VectorEupReservationCycles is a pure-virtual on Target at vtable slot +0x480. The four concrete generations return a single constant each — confirmed by reading the accessor bodies directly:
| Gen | Accessor | Returns |
|---|---|---|
| Jellyfish | JellyfishTarget::VectorEupReservationCycles @0x1d490660 | 1 |
| Pufferfish | PufferfishTarget::VectorEupReservationCycles @0x1d494cc0 | 2 |
| Viperfish | ViperfishTarget::VectorEupReservationCycles @0x1d49b060 | 1 |
| Ghostlite | GhostliteTarget::VectorEupReservationCycles @0x1d497ee0 | 1 |
NOTE — the
+0x480vtable slot collides with ~160 unrelated vtables (LLVM TTI cost functions, proto facades), so the single bundle-resource call site that consumes the reservation was not isolated to one instruction (HIGH that it is the resource-model caller, not the latency edge). The orthogonality itself — that the latency edge is provably not scaled by the reservation — is byte-confirmed from theLatencyBetween/LatencyBetweenInternaldisassembly. Which of the per-genPerformance::Resourcecolumns is the EUP unit, and whether itsresourceUsage[push][EUP]cell equalsVectorEupReservationCycles, was read structurally only; the EUP resource row was not isolated from the ctor's template-fill (LOW), so the reservation→grid binding is the one unverified link in the model.
The Result FIFO
Purpose
The latency edge bounds where a single pop may be placed; the result FIFO bounds how many pushed-but-not-yet-popped results may be in flight at once. The two together determine the maximum software-pipeline width the scheduler can open.
The Model
The EUP result FIFO is not a fixed compile-time depth. The hardware-state simulator proto holds it as a repeated EupResultFifoEntry eup_result_fifo_entries field — a runtime snapshot list of in-flight pushed results, each entry the per-lane result vector — so the depth is whatever the schedule has placed, not a libtpu literal. At schedule time the number of in-flight pushes is bounded by the latency edge plus the FIFO push/pop ordering (BaseFifoTracker<LloValue*>::FindBlockingPushesAndPops @0x14442f60, which calls LatencyTable::LatencyBetween for the FIFO-ordering edges), and at the v5+ bundle level by HasEupRestrictions forcing each push and its pop into separate bundles. The silicon FIFO's physical depth is a chip parameter and is not recoverable from this binary.
NOTE — the push and pop are FIFO-ordered:
LloInstructionsPopAndThenPushSameFifo @0x1d4f3c80head-checks that the pop drains the matching push. On the v5+ generationsHasEupRestrictionsis TRUE (Viperfish, Ghostlite), so the push and pop cannot co-issue — the late decomposer must place them in separate bundles with the latency gap between them filled by VALU correction or unrelated work. On Jellyfish and Pufferfish the restriction is FALSE, so a matching push+pop can re-fuse into a singleAndPopbundle when the schedule allows; see EUP / Transcendental Slot.
Function Map
| Function | Address | Role |
|---|---|---|
BaseFifoTracker<LloValue*>::FindBlockingPushesAndPops | 0x14442f60 | FIFO push/pop ordering edges (calls LatencyBetween) |
LloInstructionsPopAndThenPushSameFifo | 0x1d4f3c80 | FIFO-ordering head-check pairing a pop to its push |
EupResultFifoEntry (proto ctor) | 0x0e7a6cc0 | runtime repeated-message FIFO list (not a fixed depth) |
How the Cost Model Consumes the Edge
The EUP latency feeds two consumers. The dependency-graph scheduler uses LatencyBetween(push, pop) as the true-dependency edge weight, placing the pop no earlier than push_bundle + latency and opening exactly that many bundles for correction arithmetic. The bundle-cost transcendental estimate (the per-op cost the higher-level cost model attributes to a sin/cos/tanh/etc.) is sized to the same latency: a generation's transcendental estimate scales with its EUP push latency, so the cost model and the scheduler agree on the software-pipeline depth a transcendental opens. The two numerics a reimplementer needs for that estimate — the per-gen latency integer and the correction-window contents — are owned by the per-gen and coefficient pages.
The latency value is the depth the correction window fills, and the correction window is where every polynomial on the transcendental path lives. A tanh on a generation with a hardware EUP pushes raw and the result is taken directly; a recip/rsqrt may wrap a Newton refinement around the push; a sin/cos first runs Payne-Hanek range reduction to produce a pushable argument; an exp/ln on a generation or datatype lacking the hardware push falls back to a full VALU polynomial. Each of those occupies some of the N-bundle latency window, and the per-gen latency is the budget that window has. The numerics:
- The per-gen push→pop integers, the per-opcode →
Performance::Instructionclassifier, and the full nine-function block — EUP Per-Gen Integers. - The Newton/rational/
*NoEupF32correction polynomials that fill the latency window — EUP Correction Coefficients. - The trig argument reduction that precedes a
sin/cospush — EUP Payne-Hanek Range Reduction. - The packed-sub-lane unpack that determines the 1:N EUP fan-out — EUP Lane-Width Unpack.
Related Components
| Name | Relationship |
|---|---|
Performance family | the per-gen object holding both latencies[] (this edge) and the resourceUsage[] grid (the reservation) |
LatencyTable | the per-gen dispatcher that reads GetLatency for the push→pop edge |
Target accessors | VectorEupReservationCycles (the orthogonal issue rate), HasEupRestrictions (forces push/pop apart) |
Bundle packer / SlotTracker | applies the reservation as an EUP-resource occupancy, distinct from the latency edge |
Cross-References
- EUP / Transcendental Slot — the ISA-side push/pop encoding, the fused-vs-split duality, and the latency table from the encoding view
- EUP Per-Gen Integers — the per-gen push→pop latency integers, the opcode→Instruction classifier, and the full nine-function block
- EUP Correction Coefficients — the Newton/rational/
*NoEupF32polynomials that fill the latency window - EUP Payne-Hanek Range Reduction — the 1/(2π) trig reduction that precedes a
sin/cospush - EUP Lane-Width Unpack — the packed-sub-lane 1:N fan-out that multiplies the number of pushes
- Performance Family Overview — the
Performanceobject layout,GetLatency/GetResourceUsageread paths, and the JF/DF inline-POD vs heap-grid split - VPU Slot — the VALU slot family the EUP push is restricted to (slot 3)