Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

EUP Latency Overview

Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel. Other versions differ.

Abstract

The EUP — libtpu's transcendental approximator, the "XLU" of the VPU — is a deep, FIFO-buffered pipeline that a push enters in one bundle and a pop drains some bundles later. Because a TensorCore bundle is the issue packet and there is no runtime hazard interlock, the compiler must know exactly how many bundles separate a push from its drainable result, and it must place a kVectorEupResultValue pop no earlier than that. That separation is the EUP push→pop data latency, and it is the single number the bundle scheduler raises the push→pop dependency edge to. This page is the cost-model framing of that number: where it is stored, how it is read on each generation, how it composes with the orthogonal EUP issue reservation, and what a reimplementer must reproduce to schedule a transcendental correctly.

The latency model is the cost-side dual of the ISA-side push/pop encoding. The EUP / Transcendental Slot page owns how a transcendental is encoded (the Alu3 slot-3 push, the function selector, the 0x14e pop, the fused-AndPop-vs-split duality); this page owns when the pop may be scheduled. The mechanism is a per-generation cost table read by the dependency graph: LatencyTable::LatencyBetween(push, pop) calls a per-gen LatencyBetweenInternal, which on the v4+ generations classifies the LLO push opcode to a per-gen Performance::Instruction ordinal and reads Performance::GetLatency(Instruction) = latencies[Instruction] from a heap int32 array. The legacy Jellyfish/Dragonfish path does not have that array; it clamps the edge to a fixed Performance[+0x30] field instead. Either way, the value returned is the minimum bundle gap from the push to its pop — exactly the software-pipeline depth the interleaved VALU correction arithmetic must fill.

The number a reimplementer is tempted to confuse it with is VectorEupReservationCycles, the EUP unit's issue rate (how many bundles the unit stays reserved after a push, bounding back-to-back pushes). The two are read from different places, bound different gaps, and do not multiply: the latency is the push→pop deadline, the reservation is the push→push throughput, and the scheduler composes them as max(latency-deadline, resource-availability). Pufferfish is the one generation where they diverge — a 7-cycle latency with a half-rate (2-cycle) reservation — which is precisely the case that breaks a naive latency × reservation model. The per-gen latency integers, the correction polynomials, the Payne-Hanek reduction, and the lane-width fan-out each get their own page; this one explains the edge they all plug into.

For reimplementation, the contract is:

  • The two-array model. Each v4+ Performance object holds a flat latencies[Instruction] heap array (the data-latency edge) and a resourceUsage[Instruction][Resource] 2D grid (the issue reservation). GetLatency reads the first; the reservation comes from the second plus the VectorEupReservationCycles accessor. They are physically distinct arrays.
  • The read path. LatencyBetween → per-gen LatencyBetweenInternalGet<Gen>Instruction(push)GetLatency(Instruction), with the EUP push not in the transpose/MXU floor set, so the raw value passes through unmodified.
  • The per-gen latency integers (PF 7, VF 6, GL 13 F32 / 14 BF16, JF/DF 4-clamp) and the uniform-across-functions, split-by-datatype structure — owned in full by EUP Per-Gen Integers.
  • The reservation orthogonality. VectorEupReservationCycles (PF 2, VF/GL/JF 1) bounds push→push spacing, not the push→pop window; the composition is a max, not a product.
Edge dispatcherLatencyTable::LatencyBetween @0x1c89f820 — virtual LatencyBetweenInternal at vtable +0x18
v4+ read pathGet<Gen>Instruction(push)<Gen>Performance::GetLatency(Instr) = latencies[Instr]
GetLatencyVF @0x1c8cbc20, PF @0x1c8c3860, GL @0x1c8d36e0 — identical bounds-checked latencies[idx]
Push→pop latencyJF/DF 4 (clamp); PF 7; VF 6; GL 13 (F32) / 14 (BF16); pop drains at latency 1
JF/DF clampLatencyTableJellyfish::LatencyBetweenInternal @0x1c8a0d60from∈[0x128,0x13a] ∧ to==0x14e → field +0x1c
Issue reservationVectorEupReservationCycles (Target vtable +0x480): JF/VF/GL = 1, PF = 2 (half-rate)
Compositionmax(push_bundle + latency, prev_push + reservation) — latency ⊥ reservation, never a product
Software-pipeline depththe latency (PF 7), independent of the reservation

The Push→Pop Latency Edge

Purpose

A transcendental is a two-instruction sequence: a push that reserves the EUP and a pop that retrieves the result N bundles later. The scheduler must not place the pop before push_bundle + N, or it would read an in-flight (not-yet-drained) result. N is the push→pop data latency — the EUP pipeline depth — and it is the dependency-graph edge weight from push to pop. The intervening N bundles are where the late decomposer interleaves VALU correction arithmetic (Newton steps, range reconstruction) or unrelated work, exactly as a CPU back end hides a long-latency divide. Because the bundle is the issue packet, this latency is not a runtime hazard the hardware stalls on; it is a compile-time scheduling constraint the cost model encodes.

The Two Mechanisms

There are two ways the edge is computed, split at the Jellyfish/Pufferfish silicon boundary. The legacy path clamps; the modern path reads a per-instruction array.

function LatencyBetweenInternal_jellyfish(from, to):   // @0x1c8a0d60 (JF/DF)
    edge = 1
    // ... RPU / IndexedStore / matres floors elided ...
    is_eup_push = (from.opcode - 0x128) <= 0x12          // from ∈ [0x128, 0x13a]
    if (is_eup_push && to.opcode == 0x14e)               // 0x14e = kVectorEupResultValue (pop)
       || LloOpcodeIsPseudoEupInstruction(from.opcode):  // or a fused AndPop op
        if edge <= this[+0x1c]:                          // clamp UP to the EUP-latency field
            edge = this[+0x1c]                            // = Performance[+0x30] = 4
    return edge

The clamp field this+0x1c is copied by the LatencyTableJellyfish ctor from Performance[+0x30] = 4 — the flat-POD generations have no per-function transcendental latency, just one constant. (Performance[+0x30] lives in the JF/DF inline-POD model of the Performance Family Overview; on Jellyfish it is one of the .rodata-block-copied cells, not a scalar store.)

The v4+ generations route through a per-instruction heap array instead. The structure is uniform across Pufferfish, Viperfish, and Ghostlite:

function LatencyBetweenInternal_modern(from, to):   // PF @0x1c8a2aa0 / VF @0x1c8a4ac0 / GL @0x1c8b22e0
    instr = Get<Gen>Instruction(from)                // LLO opcode → per-gen Instruction ordinal
    base  = <Gen>Performance::GetLatency(instr)      // latencies[instr] from the heap array
    // a transpose-latch HALVING (base/2) applies ONLY to the matmul-latch region
    //   (GL/VF opcodes 0x8d..0x96 / 0x73..0x7c) — the EUP push (0x128..0x13a) is excluded
    // XLU / MXU max-combines only RAISE base on a conflict — the EUP push has none
    return base                                      // EUP push: passes through unmodified

GetLatency is the same four-instruction lookup on every modern gen — a bounds check against the element count at Performance[+0x8], then latencies[instr] from the pointer at Performance[+0x0]:

function <Gen>Performance::GetLatency(perf, instr):   // VF @0x1c8cbc20, PF @0x1c8c3860, GL @0x1c8d36e0
    if perf.latency_count <= instr:    // [perf+0x8]
        trap()                          // ud2 / BUG
    return perf.latency_ptr[instr]      // *(int32*)([perf+0x0] + instr*4)

GOTCHA — the EUP push is the latency-carrying instruction; the pop is a 1-cycle drain. The push→pop dependency edge is computed as LatencyBetweenInternal(first = push, second = pop), and the per-gen path evaluates GetLatency on the first operand — the push. So the edge weight equals the push latency (6/7/13/14). The 0x14e pop's own latencies[pop] = 1 is not the push→pop gap; it is the latency the drained value carries to whatever instruction consumes it downstream. A reimplementer who reads GetLatency(pop) for the push→pop edge gets 1 and schedules the pop one bundle after the push — reading garbage.

Per-Gen Latency Values

The push value is uniform across every classified EUP function on a given generation — tanh = pow2 = recip = log2 = rsqrt = sigshft = sinq = cosq = erf — so the EUP unit has a single transcendental latency per datatype, not a per-function table. Ghostlite is the only generation that splits F32 from BF16 (it keeps the 16-bit BF16 lane, so the BF16 transcendental costs one extra cycle); Viperfish and Pufferfish classify only the F32 pushes and widen BF16 EUP to the F32 push, so they have one latency.

GenTpuVersionEUP push latencyPop latencyMechanismByte anchor
Jellyfishv24(clamp)Performance[+0x30] = 4, clamped in LatencyBetweenInternalclamp field +0x1c
Dragonfishv34(clamp)inherits JF +0x30 = 4(= JF)
Pufferfishv471PufferfishPerformance latencies[]ctor [ptr+0x19c..0x1b0]=7, [+0x1d8]=1
Viperfishv5p61ViperfishPerformance latencies[]ctor [ptr+0x330..0x348]=6, [+0x5a0]=1
Ghostlitev6e13 (F32) / 14 (BF16)1GhostlitePerformance latencies[]ctor [ptr+0x418..0x43c]=0xd, [+0x440..0x460]=0xe, [+0x710]=1

The byte anchors above were confirmed cell-by-cell against the constructors: Pufferfish's PufferfishPerformance::PufferfishPerformance @0x1c8be080 stores latencies[0x67..0x6c] = 7 (six EUP-push ordinals) and latencies[0x76] = 1 (the pop ordinal); Ghostlite's ctor stores latencies[0x106..0x10f] = 13 (ten F32 pushes), latencies[0x110..0x118] = 14 (nine BF16 pushes), latencies[0x1c4] = 1; Viperfish's stores latencies[0xcc..0xd2] = 6 and latencies[0x168] = 1. The per-gen opcode→Instruction classifier that turns an LLO push opcode into the array index, and the full nine-function block, are owned by EUP Per-Gen Integers.

QUIRK — Pufferfish's latency table is a std::variant<PufferfishPerformance, PufferfishBarnaCorePerformance> dispatched by a 2-arm __fmatrix visitor (LatencyFromInstruction @0x21c203d0); the high 16 bits of GetPufferfishInstruction's return select the variant (shr r14d,0x10 then call [fmatrix + idx*8]). Every EUP opcode emits variant 0 (TensorCore), so the EUP edge is the TensorCore array = 7 — not the BarnaCore variant, which carries its own lower-latency EUP block (= 6 at ordinals 0x77..0x7c, byte-confirmed in the PufferfishBarnaCorePerformance ctor @0x1c8c38c0) for the legacy embedding engine. Pufferfish ships both because BarnaCore retires only after Pufferfish; the TensorCore transcendental never reaches the 6-cycle BarnaCore block.

Function Map

FunctionAddressRole
LatencyTable::LatencyBetween0x1c89f820edge dispatcher; calls virtual LatencyBetweenInternal, floors only 0x82/0x84
LatencyTableJellyfish::LatencyBetweenInternal0x1c8a0d60JF/DF EUP clamp to field +0x1c (= Performance[+0x30] = 4)
LatencyTablePufferfish::LatencyBetweenInternal0x1c8a2aa0PF path; __fmatrix variant dispatch (EUP = variant 0)
LatencyTableViperfish::LatencyBetweenInternal0x1c8a4ac0VF path; GetViperfishInstructionGetLatency
LatencyTableGhostlite::LatencyBetweenInternal0x1c8b22e0GL path; GetGhostliteInstructionGetLatency(this+0x1d0)
<Gen>Performance::GetLatency0x1c8c3860 (PF), 0x1c8cbc20 (VF), 0x1c8d36e0 (GL)bounds-checked latencies[Instruction]
PufferfishPerformance ctor0x1c8be080fills EUP=7 (TensorCore variant 0), pop=1
ViperfishPerformance ctor0x1c8c4840fills EUP=6, pop=1
GhostlitePerformance ctor0x1c8cbc80fills F32 EUP=13, BF16=14, pop=1

Latency vs Reservation: Why They Do Not Multiply

Purpose

The push→pop pair is bounded by two independent quantities read from two different arrays. Conflating them is the single most common way to mis-schedule a transcendental, so the cost model keeps them physically separate: the data latency (push→pop deadline) lives in latencies[Instruction]; the issue reservation (push→push throughput) comes from VectorEupReservationCycles and the per-instruction resourceUsage grid. A reimplementer who multiplies them over-counts every EUP-bound schedule on Pufferfish.

The Edge Passes Through Unmodified

LatencyBetween does not scale the edge by any reservation field. It calls the per-gen LatencyBetweenInternal, optionally adds uniform-random jitter (a noise model gated on a flag at [this+0x10]), then special-cases only the matres/transpose opcodes — 0x82 (matres) clamps an edge below 3 up to 2, and 0x84 (transpose) takes an MXU AutoProto floor — and otherwise returns the internal value verbatim. The EUP push is neither 0x82 nor 0x84, so its edge passes through untouched:

function LatencyTable::LatencyBetween(this, from, to):   // @0x1c89f820
    edge = this->LatencyBetweenInternal(from, to)        // virtual, vtable +0x18
    if noise_model_enabled([this+0x10]):
        edge += UniformRandom(0, 101)                    // optional scheduling jitter
    if from.opcode == 0x84 && to.opcode == 0x84:         // transpose-only MXU floor
        edge = max(edge, MxuProtoFloor(from))
    else if from.opcode == 0x82 && (to.opcode - 0x82) <= 2 && edge < 3:
        edge = 2                                          // matres-only floor
    return edge                                          // EUP push: returned unchanged
    // NO multiply by VectorEupReservationCycles anywhere on this path

The Composition

The reservation is the EUP unit's issue occupancy — how many bundles the unit stays reserved after a push, applied by the bundle packer's per-instruction resource model (the resourceUsage grid plus the SlotTracker), not by the latency edge. The two constraints compose as a max:

QuantityBoundsSourcePFVFGL (F32/BF16)JF
push→pop data latencymin bundles push → its drainlatencies[Get<Gen>Instr(push)]7613 / 144 (clamp)
pop drain latencylatency the drained value carrieslatencies[pop Instr]111
VectorEupReservationCyclesmin bundles push → next pushTarget accessor (+0x480)2111
bundle(pop)      >= bundle(push)   + latency                  // deadline  (e.g. push+7 on PF)
bundle(push_i+1) >= bundle(push_i) + VectorEupReservationCycles // throughput (e.g. +2 on PF)
                                                              // → effective gap = max(...), not product

QUIRK — Pufferfish's half-rate EUP (reservation = 2) does not double the 7-cycle push→pop window. It halves the issue rate: one new push may enter the EUP every 2 bundles. The VALU-correction software-pipeline depth the late decomposer must fill is the latency (7), independent of the reservation. For a chain of N independent transcendentals on Pufferfish the EUP-bound schedule length is ≈ 2·(N−1) + 7 bundles (throughput-bound spacing between pushes, plus one final 7-cycle drain), not 7·2. On Viperfish and Ghostlite, where reservation = 1, the two constraints coincide for back-to-back pushes and only the latency tail (6 / 13 / 14) matters.

The VectorEupReservationCycles Accessor

VectorEupReservationCycles is a pure-virtual on Target at vtable slot +0x480. The four concrete generations return a single constant each — confirmed by reading the accessor bodies directly:

GenAccessorReturns
JellyfishJellyfishTarget::VectorEupReservationCycles @0x1d4906601
PufferfishPufferfishTarget::VectorEupReservationCycles @0x1d494cc02
ViperfishViperfishTarget::VectorEupReservationCycles @0x1d49b0601
GhostliteGhostliteTarget::VectorEupReservationCycles @0x1d497ee01

NOTE — the +0x480 vtable slot collides with ~160 unrelated vtables (LLVM TTI cost functions, proto facades), so the single bundle-resource call site that consumes the reservation was not isolated to one instruction (HIGH that it is the resource-model caller, not the latency edge). The orthogonality itself — that the latency edge is provably not scaled by the reservation — is byte-confirmed from the LatencyBetween/LatencyBetweenInternal disassembly. Which of the per-gen Performance::Resource columns is the EUP unit, and whether its resourceUsage[push][EUP] cell equals VectorEupReservationCycles, was read structurally only; the EUP resource row was not isolated from the ctor's template-fill (LOW), so the reservation→grid binding is the one unverified link in the model.


The Result FIFO

Purpose

The latency edge bounds where a single pop may be placed; the result FIFO bounds how many pushed-but-not-yet-popped results may be in flight at once. The two together determine the maximum software-pipeline width the scheduler can open.

The Model

The EUP result FIFO is not a fixed compile-time depth. The hardware-state simulator proto holds it as a repeated EupResultFifoEntry eup_result_fifo_entries field — a runtime snapshot list of in-flight pushed results, each entry the per-lane result vector — so the depth is whatever the schedule has placed, not a libtpu literal. At schedule time the number of in-flight pushes is bounded by the latency edge plus the FIFO push/pop ordering (BaseFifoTracker<LloValue*>::FindBlockingPushesAndPops @0x14442f60, which calls LatencyTable::LatencyBetween for the FIFO-ordering edges), and at the v5+ bundle level by HasEupRestrictions forcing each push and its pop into separate bundles. The silicon FIFO's physical depth is a chip parameter and is not recoverable from this binary.

NOTE — the push and pop are FIFO-ordered: LloInstructionsPopAndThenPushSameFifo @0x1d4f3c80 head-checks that the pop drains the matching push. On the v5+ generations HasEupRestrictions is TRUE (Viperfish, Ghostlite), so the push and pop cannot co-issue — the late decomposer must place them in separate bundles with the latency gap between them filled by VALU correction or unrelated work. On Jellyfish and Pufferfish the restriction is FALSE, so a matching push+pop can re-fuse into a single AndPop bundle when the schedule allows; see EUP / Transcendental Slot.

Function Map

FunctionAddressRole
BaseFifoTracker<LloValue*>::FindBlockingPushesAndPops0x14442f60FIFO push/pop ordering edges (calls LatencyBetween)
LloInstructionsPopAndThenPushSameFifo0x1d4f3c80FIFO-ordering head-check pairing a pop to its push
EupResultFifoEntry (proto ctor)0x0e7a6cc0runtime repeated-message FIFO list (not a fixed depth)

How the Cost Model Consumes the Edge

The EUP latency feeds two consumers. The dependency-graph scheduler uses LatencyBetween(push, pop) as the true-dependency edge weight, placing the pop no earlier than push_bundle + latency and opening exactly that many bundles for correction arithmetic. The bundle-cost transcendental estimate (the per-op cost the higher-level cost model attributes to a sin/cos/tanh/etc.) is sized to the same latency: a generation's transcendental estimate scales with its EUP push latency, so the cost model and the scheduler agree on the software-pipeline depth a transcendental opens. The two numerics a reimplementer needs for that estimate — the per-gen latency integer and the correction-window contents — are owned by the per-gen and coefficient pages.

The latency value is the depth the correction window fills, and the correction window is where every polynomial on the transcendental path lives. A tanh on a generation with a hardware EUP pushes raw and the result is taken directly; a recip/rsqrt may wrap a Newton refinement around the push; a sin/cos first runs Payne-Hanek range reduction to produce a pushable argument; an exp/ln on a generation or datatype lacking the hardware push falls back to a full VALU polynomial. Each of those occupies some of the N-bundle latency window, and the per-gen latency is the budget that window has. The numerics:


NameRelationship
Performance familythe per-gen object holding both latencies[] (this edge) and the resourceUsage[] grid (the reservation)
LatencyTablethe per-gen dispatcher that reads GetLatency for the push→pop edge
Target accessorsVectorEupReservationCycles (the orthogonal issue rate), HasEupRestrictions (forces push/pop apart)
Bundle packer / SlotTrackerapplies the reservation as an EUP-resource occupancy, distinct from the latency edge

Cross-References