EUP Lane-Width and Unpack Count
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, not stripped)..textand.rodataVMAs equal their file offsets;.data.rel.roVMA minus0x200000equals its file offset. Other libtpu builds will differ.
Abstract
A transcendental on the SparseCore is not one cost — it is N copies of one cost, where N is the unpack fan-out: how many lane-width pieces a packed operand vector splits into before each piece can be pushed into the Extended Unary Pipeline (EUP). This page is the cost-model half of the EUP datapath. The EUP / Transcendental Slot page documents how a push is encoded; this page documents how many pushes a given operand costs, and how the push→pop latency is hidden by a software-pipelining split rather than charged inline. Both quantities are decided before the bundle scheduler ever runs, by two MLIR-level mechanisms a reimplementer's cost model must reproduce.
The first mechanism is the lane-width / unpack-count model. The SparseCore lowering registers two competing patterns for every elementwise float op: a 1:1 UnaryFloatVector pattern that emits a single fused EUP macro, and a 1:N AluEp pattern that unpacks a packed operand into lane-width sub-elements, runs one compute op per piece, and repacks. IsDynamicallyLegal (@0x135ddd20) decides which fires; when it fires the 1:N path, UnpackOperand<UnpackFOp> (@0x1360fac0) recursively halves the packed element down to the lane sub-element width and returns a std::deque<Value> whose length is N. The lane width itself is a single per-generation boolean: SupportsBf16AluInstructions (Target vtable +0x780) — FALSE means a 32-bit (F32) lane, TRUE means a native 16-bit (BF16) lane. The cost consequence is concrete and counter-intuitive: a generation with the native BF16 ALU (Ghostlite) processes a packed bf16 transcendental at half the push count of a generation without it (Viperfish), because it never has to unpack to F32. The unpack count is the multiplier the per-piece transcendental cost is charged N times against.
The second mechanism is the bare-push / tpu_eup_pop software-pipelining split. MLIR emits a transcendental as a fused kVector{fn}{F32,Bf16}AndPop pseudo-op (opcodes 0x13b..0x14d). On a v5+ generation the EUP push and its pop cannot co-issue (HasEupRestrictions is TRUE on Viperfish and Ghostlite), so LloLateDecomposer rewrites the fused op into a bare hardware push (CreateVectorEup, 0x128..0x13a) and a deferred pop (CreateVectorEupResult, hardcoded 0x14e kVectorEupResultValue) in a separate bundle. The cost-model implication is that a transcendental's latency is not a single bundle's stall — it is a push→pop window the scheduler fills with unrelated VALU work, and the EUP throughput is governed by an orthogonal issue-reservation rate (VectorEupReservationCycles), not by the latency. A cost model that multiplies latency by reservation, or that charges the push and pop as one indivisible op, gets the schedule wrong.
For reimplementation, the cost model's contract is:
- The lane sub-element width.
SupportsBf16AluInstructions(vtable+0x780) as the 16-vs-32-bit selector, its per-gen values, and theoptional<int>override that lets a caller force a non-default width. - The 1:1-vs-1:N decision. The four-arm
IsDynamicallyLegaltruth table that picks the EUP-macro path (cost 1) versus the AluEp unpack path (costN). - The unpack count
N.UnpackOperand's recursive halving — the per-step piece countresult_bw / sub_bw, the loop exit at lane width, and how the vector shape folds in. - The bare-push / deferred-pop split. That the
V*Decomposedbuilders emit a bare push + bare pop with no inline correction, that the split is mandated byHasEupRestrictions, and that the latency-hiding work is unrelated VALU instructions the scheduler interleaves — not part of the EUP op. - The latency / reservation orthogonality.
VectorEupReservationCycles(push→push spacing) as a separate quantity from the push→pop latency the correction window fills.
| Lane-width selector | Target::SupportsBf16AluInstructions (vtable +0x780); VF 0x1d49c0e0 → FALSE (32-bit), GL 0x1d498ce0 → TRUE (16-bit) |
| 1:1-vs-1:N gate | IsDynamicallyLegal (@0x135ddd20) — LEGAL ⇒ 1:1 macro (cost 1), ILLEGAL ⇒ AluEp unpack (cost N) |
| Unpack-count engine | UnpackOperand<UnpackFOp> (@0x1360fac0) — recursive halving; deque length = N |
| Sub-element type | GetUnpackResultElementType (@0x1360ff20) — 16→BF16, 32→F32, sub-8-bit → int(2·src_bw) |
| Split trigger | HasEupRestrictions — VF 0x1c458620, GL 0x1c458d80 = TRUE; JF 0x1c457b80, PF 0x1c4580c0 = FALSE |
| Decomposer | LloLateDecomposer (@0x1269cb20) → DecomposeEupInstruction (@0x126a0340) → 10 V*Decomposed builders across 19 cases |
| Bare push / deferred pop | CreateVectorEup (@0x1d4d78a0, 0x128..0x13a) + CreateVectorEupResult (@0x1d4d9820, hardcoded 0x14e) |
| Issue reservation | VectorEupReservationCycles (vtable +0x480): JF/VF/GL = 1, PF = 2 (half-rate EUP) |
The Lane Sub-Element Width
Purpose
The lane width is the floor the unpack model halves down to: a piece narrower than the lane is staged no further; a piece wider must be split. It is the single per-generation input that turns the abstract "unpack a packed operand" into a concrete count. A reimplementer's cost model reads it once per target and uses it everywhere the AluEp fan-out is computed.
Algorithm
The width is one virtual call. Target::SupportsBf16AluInstructions is vtable slot +0x780 (decimal 1920, the immediate that appears in every consumer). The decompile of the two v5+ overrides is trivial and decisive:
function ViperfishTarget::SupportsBf16AluInstructions(): // 0x1d49c0e0
return 0 // FALSE → 32-bit (F32) lane
function GhostliteTarget::SupportsBf16AluInstructions(): // 0x1d498ce0
return 1 // TRUE → 16-bit (BF16) lane
The base Target::SupportsBf16AluInstructions (@0x1d61f580) is a LogFatal pure-virtual sentinel — every concrete generation must override, so a reimplementer cannot leave it unset. The boolean is converted to a width by the same arithmetic in both unpack consumers: width = 16 * (supports ^ 1) + 16, i.e. TRUE → 16, FALSE → 32.
// from GetUnpackResultElementType @0x1360ff20 and UnpackOperand @0x1360fac0:
lane_bw = 16 * (target.SupportsBf16AluInstructions() ^ 1) + 16 // *0x780 → {16, 32}
if (override & 0x100000000) // optional<int> engaged-bit
lane_bw = override // caller-forced width (f8 / sub-byte staging)
NOTE — the
Target const&the lowering passes is the basexla::jellyfish::Target(a per-genViperfishTarget/GhostliteTarget), not the derived…SparseCoreTarget. The+0x780slot on the gen-target vtable isSupportsBf16AluInstructions; on theSparseCoreTargetvtable the same offset is a different method (NumVsSlots), a different class layout that is not consulted here. A reimplementer dispatching+0x780on the wrong object reads the wrong field.
Function Map
| Function | Address | Role |
|---|---|---|
ViperfishTarget::SupportsBf16AluInstructions | 0x1d49c0e0 | returns FALSE → 32-bit (F32) lane |
GhostliteTarget::SupportsBf16AluInstructions | 0x1d498ce0 | returns TRUE → 16-bit (BF16) lane |
Target::SupportsBf16AluInstructions (base) | 0x1d61f580 | LogFatal pure-virtual; must be overridden |
Considerations
The optional<int> override (the bt $0x20 / cmovb engaged-bit test, byte-confirmed in both UnpackOperand and GetUnpackResultElementType) lets a caller force a non-default lane width for f8 / sub-byte staging instead of the SupportsBf16AluInstructions-derived 16/32. The default path (no override) is byte-confirmed; which AluEp callers pass a non-default override was not enumerated, so the override-driven fan-out variants are LOW confidence. See Pack/Unpack Precision for the full VpackFormat model these widths interact with.
The 1:1-vs-1:N Decision
Purpose
Before any unpack count exists, the lowering decides whether to unpack at all. A transcendental source op (e.g. sparse_core::RsqrtOp) is registered with both a 1:1 UnaryFloatVector pattern (one fused EUP macro, cost 1) and a 1:N AluEp pattern (unpack → N compute ops → repack, cost N). IsDynamicallyLegal is the per-op dynamic-legality predicate that arbitrates: if the op is LEGAL it stays as the macro (1:1); if ILLEGAL the conversion runs the AluEp unpack (1:N). The cost model must reproduce this predicate exactly, because it is the difference between charging a transcendental once and charging it N times.
Algorithm
IsDynamicallyLegal (@0x135ddd20) is a four-arm truth table, byte-confirmed in the decompile:
function IsDynamicallyLegal(op, target, operand_idx): // 0x135ddd20
forced = ForceBF16ALUOperationsToUnpack(op, ty) // @0x135dd6e0 — force-UNPACK (1:N) flag
if ty.typeID != mlir::VectorType::id: return LEGAL // scalar operand → 1:1
if !IsPackedVectorType(ty, op): return LEGAL // vector but not packed → 1:1
fmt = GetVpackFormat(ty) // @0x13dad800: 0=no-pack,1=bf16,0xb=f16,0x7=sub-byte
if fmt == 0 || forced: return (fmt == 0) // not packable → 1:1; forced+packable → !fmt = ILLEGAL (1:N)
if !target.SupportsBf16AluInstructions(): // *0x780 FALSE (VF) → 1:N
return ILLEGAL
return (ty.element.bitwidth == 16) ? LEGAL : ILLEGAL // native bf16-16 → 1:1, else (f8/s8) → 1:N
The four arms that matter for cost, as a truth table:
| operand type | IsPackedVectorType | GetVpackFormat | SupportsBf16Alu | elem bw | result | EUP cost |
|---|---|---|---|---|---|---|
| not a vector / not packed | — / false | — | — | — | LEGAL | 1 (macro) |
packed, fmt == 0 | true | 0 (no-pack) | — | — | LEGAL | 1 (macro) |
packed, fmt != 0 | true | != 0 | FALSE (VF) | any | ILLEGAL | N (unpack) |
packed, fmt != 0 | true | != 0 | TRUE (GL) | 16 (bf16) | LEGAL | 1 (macro) |
packed, fmt != 0 | true | != 0 | TRUE (GL) | != 16 (f8/s8) | ILLEGAL | N (unpack) |
QUIRK — the cheapest path is the newest generation on its native datatype. A packed
bf16op is LEGAL (cost 1) only when the generation has the BF16 ALU and the element is exactly 16-bit. Viperfish — which lacks the BF16 ALU — marks the same op ILLEGAL and pays theN-way unpack to F32. A reimplementer who assumes "packed always unpacks" over-charges Ghostlite by a factor ofN; one who assumes "transcendentals are always 1:1" under-charges Viperfish by the same factor.
GOTCHA —
ForceBF16ALUOperationsToUnpack(@0x135dd6e0) forces the ILLEGAL / 1:N unpack path, not the 1:1 macro — its name is literal. When the op carries thesc.emit_vectorized_alu_operation_in_f32_precisionBoolAttr true and its elementisBF16, the function returns 1, and the(fmt == 0) | forcedshort-circuit in the decompile (if ((VpackFormat == 0) | v4) return !VpackFormat) returns!fmt. For a packed bf16 opfmt == 1, so!fmt == 0 == ILLEGAL— the op is forced to unpack to F32 and run the vectorized ALU in F32 precision. The flag therefore only takes effect on a generation that would otherwise mark the bf16 op LEGAL (Ghostlite, native BF16 ALU): it overrides that to a 1:N F32 unpack.
Function Map
| Function | Address | Role |
|---|---|---|
IsDynamicallyLegal | 0x135ddd20 | 1:1-vs-1:N selector (4-arm truth table) |
ForceBF16ALUOperationsToUnpack | 0x135dd6e0 | force-1:N flag; on the f32-precision attr drives a packed bf16 op to ILLEGAL (unpack to F32) |
IsPackedVectorType | 0x13611720 | sub-element-packed vector test |
GetVpackFormat | 0x13dad800 | pack-format enum (0=none,1=bf16,0xb=f16,0x7=sub-byte) |
Considerations
Both pattern families are registered for the same source ops (sc::RsqrtOp has a 1:1 UnaryFloatVector at 0x1357e540 and a 1:N AluEp at 0x135e1c80); IsDynamicallyLegal is the dynamic-legality predicate that decides which applies, and the conversion framework rewrites the ILLEGAL op through the AluEp pattern. The PatternBenefit ordering between the two co-registered patterns was read structurally, not byte-decoded, so the claim that dynamic legality is the sole arbiter (with no benefit tie) is HIGH, not CERTAIN.
The Unpack Count N
Purpose
This is the cost multiplier. When IsDynamicallyLegal marks an op ILLEGAL, the AluEp body calls UnpackOperand<UnpackFOp> to split the packed operand into lane-width pieces, emits exactly one compute op per piece, and repacks. The number of pieces — the returned std::deque<Value> length — is N, and the AluEp body's per-piece ComputeOp::create loop runs N times. A reimplementer's elementwise / transcendental cost is N × (per-piece op cost).
Algorithm
UnpackOperand<UnpackFOp> (@0x1360fac0) seeds a deque with the wide operand, then halves repeatedly until each piece fits the lane width:
function UnpackOperand(loc, packBits, operand, vecTy, builder, target, override): // 0x1360fac0
lane_bw = 16 * (target.SupportsBf16AluInstructions() ^ 1) + 16 // *0x780 → {16, 32}
if (override & 0x100000000): lane_bw = override // engaged optional<int>
elem = vecTy.getElementType()
deque = [operand] // seed with the wide value
do: // outer loop @0x1360fb70
sub_ty = GetUnpackResultElementType(elem, builder, target, override) // @0x1360ff20
sub_vec = GetVectorType(sub_ty, target, ...) // CHECK: status OK
count = bitwidth(elem) / bitwidth(sub_ty) // pieces this step (idiv)
tuple = UnpackFOp::create(...) // one unpack → a tuple
for i in 0 .. count-1: // inner loop @0x1360fb80
deque.push( ExtractTupleElementOp::create(tuple, i) ) // one piece per tuple slot
elem = sub_ty // narrow for next iteration
while (lane_bw > bitwidth(elem)) // exit when piece fits lane
return deque // length == N
The sub-element type per step is GetUnpackResultElementType (@0x1360ff20):
function GetUnpackResultElementType(elem, builder, target, override): // 0x1360ff20
lane_bw = 16 * (target.SupportsBf16AluInstructions() ^ 1) + 16 // *0x780
if (override & 0x100000000): lane_bw = override
src_bw = elem.getIntOrFloatBitWidth()
out_bw = (src_bw >= 8) ? lane_bw : 2 * src_bw // ≥8-bit jumps to lane; sub-8-bit doubles
if elem is a FloatType:
if out_bw == 16: return builder.getBF16Type()
if out_bw == 32: return builder.getF32Type()
return builder.getIntegerType(out_bw)
So a bf16 (16-bit) element on a 32-bit lane unpacks in one step (count = 32/16 = 2 pieces, then lane_bw == elem_bw exits); a sub-8-bit element doubles its width each step (out_bw = 2·src_bw) until it reaches the lane width, multiplying the piece count by 2 per step. The total N is the product over halving steps of result_bw / sub_bw, folded with the vector shape's lane tiling.
NOTE — the per-step piece count
result_bw / sub_bwis byte-exact (theidivat0x1360fbfc). The totalNfor a multi-element vector folds in thegetShapeproduct divided by the lane-tiling capacity (the sameLaneCount = 128sub-lane geometry the per-piece cost uses); that shape-folding factor was applied structurally, not re-dumped — so the per-step count is CERTAIN and the absoluteNfor a given tensor shape is INFERRED from the lane-tiling geometry.
The AluEp Body Charges N Compute Ops
The representative AluEp body, AluEp<math::ExpOp> (@0x135df200), makes the cost concrete:
AluEp<math::ExpOp>::matchAndRewrite: // 0x135df200
if !target.SupportsSparseCore(): bail // *0x260 — ENTRY GUARD, not a width input
if IsDynamicallyLegal(op, target):
ExpOp::create(...) // 1:1 — a single fresh op (cost 1)
replaceOp; return
deque = UnpackOperand<UnpackFOp>(...) // N narrow pieces @0x135df55b
for piece in deque: // walk the deque @0x135df8f0
results.push( ExpOp::create(piece) ) // ONE compute op per piece (N total)
PackResults<PackFOp>(results) // repack the N results @0x135df7ba
replaceOp
The compute-op count equals the deque length N. For a transcendental this is N independent EUP push/pop pairs (each piece is its own push); for a plain arithmetic op it is N VALU ops. This is the N the per-piece transcendental / elementwise cost is charged against.
Function Map
| Function | Address | Role |
|---|---|---|
UnpackOperand<UnpackFOp> | 0x1360fac0 | recursive halving; returns the N-deque |
GetUnpackResultElementType | 0x1360ff20 | next-narrower sub-element type per lane width |
AluEp<math::ExpOp> (representative) | 0x135df200 | N ComputeOp::create between one unpack and one pack |
PackResults<PackFOp> | 0x13610940 | reassemble the N results into the wide output |
Target::SupportsSparseCore | 0x1d48fd40 | AluEp entry guard (vtable +0x260), not a width input |
Considerations
SupportsSparseCore (vtable +0x260, topology[+0x3b8][+0x98] > 0) is the AluEp matchAndRewrite entry guard — the lowering bails if the target has no SparseCore — and is easy to conflate with the +0x780 width selector because both gate the same lowering. It is not an input to the unpack count; the cost model must read width from +0x780 only. The unpack helpers come in float / signed-int / unsigned-int flavors (UnpackOperand<UnpackFOp/SIOp/UIOp>); only the UnpackFOp (float) path is decoded here, and the int paths are assumed to follow the same halving by structural symmetry.
The Bare-Push / Deferred-Pop Split
Purpose
The unpack count fixes how many EUP pushes a transcendental costs; the split fixes how the latency of each push is paid. MLIR emits a transcendental as a single fused kVector{fn}{F32,Bf16}AndPop pseudo-op. On a v5+ generation the push and pop cannot live in one bundle, so the late decomposer splits the fused op into a bare hardware push and a deferred pop placed one or more bundles later. The cost-model consequence: a transcendental's latency is a window the scheduler fills with unrelated work, not a stall — and the EUP throughput is the orthogonal push→push reservation, not the latency. This is the classic software-pipelining of a long-latency divide, exposed in the instruction stream because a TPU bundle is the issue packet and there is no runtime hazard interlock.
Algorithm
The fused pseudo-ops occupy opcodes 0x13b..0x14d and are identified by LloOpcodeIsPseudoEupInstruction (@0x1d60c880):
function LloOpcodeIsPseudoEupInstruction(op): // 0x1d60c880
return (op - 0x13b) < 0x13 // op in [0x13b, 0x14d] (19 AndPop ops)
&& (0x7fdff >> (op - 0x13b)) & 1 // bitmask clears bit 9 → excludes 0x144 PushErfAndPop
The 0x7fdff mask (binary 0111 1111 1101 1111 1111, bit 9 clear) excludes exactly one of the 19 — 0x144 kVectorPushErfAndPop — leaving 18 fused ops treated as pseudo-EUP. LloLateDecomposer (@0x1269cb20) walks these and calls DecomposeEupInstruction (@0x126a0340), a switch on the pseudo-opcode (19 cases, 0x13b..0x14d) that dispatches to one of ten V*Decomposed builders — Vpow2, Vrecp, Vlog2, Vrsqrt, Vsigshft, Vsinq, Vcosq, Vtanh, VpushErf, Verf — each chosen by the opcode and handed a PrimitiveType of 11 (the BF16-source F32 form) or 16 (the BF16 form). Each builder is a bare push + bare pop, selecting the F32 or BF16 push opcode from the PrimitiveType argument:
function VtanhDecomposed(builder, prim_type, value): // 0x1d555040 (23 lines, byte-exact)
push_opcode = (prim_type == 0xb) ? 0x128 : 0x132 // 0xb = BF16 → F32 push 0x128; else BF16 push 0x132
push = CreateVectorEup(push_opcode, value, region) // @0x1d4d78a0 — bare push, 1 operand
AppendInstruction(region, push)
pop = CreateVectorEupResult(push, region) // @0x1d4d9820 — deferred pop, hardcodes 0x14e
AppendInstruction(region, pop)
// NO inline correction polynomial, NO refinement, NO second push/pop pair
VrsqrtDecomposed (@0x1d557b60) is byte-identical in structure (push_opcode = (prim_type == 0xb) ? 0x12c : 0x136), also 23 lines, also bare push + bare pop. CreateVectorEup asserts the push opcode is a vector-EUP opcode ((opcode - 0x128) < 0x13, i.e. inclusive 0x128..0x13a, the LloOpcodeIsVectorEup check) and that the operand ProducesVreg (its opcode_produced_register_type[] entry is the EUP class 4); CreateVectorEupResult asserts the same push range and emits New(0x14e, {push}, 1) — the pop carries no function or width, only the push handle.
GOTCHA — The
V*Decomposedbuilders (0x1d555040,0x1d557b60) are bare push + bare pop only — 23 lines each, with noCreateVectorBinop, noVfastTwoSum, and no second push/pop pair. The latency-hiding work in the push→pop window is therefore unrelated VALU instructions the scheduler interleaves, not correction arithmetic emitted by the decomposer. The transcendental's own correction math (where it exists) lives in the*NoEupF32software fallbacks and the shared Newton/rational helpers — see EUP Correction Coefficients. TheVfastTwoSumhelper (@0x1d5550a0) is a stand-alone Dekker two-sum that merely sits physically adjacent toVtanhDecomposedin.text; it is not part of the split.
The split is mandatory on v5+ because HasEupRestrictions is TRUE on Viperfish (@0x1c458620) and Ghostlite (@0x1c458d80): push and pop cannot co-issue, so the decomposer must place them in separate bundles. On Jellyfish (@0x1c457b80) and Pufferfish (@0x1c4580c0) the restriction is FALSE, so the fused AndPop form can survive — the inverse Simplify{Tanh,Reciprocal,Sinq,Cosq}AndPop simplifiers (0x1d593c60..) re-fuse a matching push+pop when the schedule allows co-location.
LLO Opcode Bands (cost-relevant)
| Band | Range | Cost role |
|---|---|---|
| Bare push | 0x128..0x13a | one EUP push (the unit of the N-way fan-out); CreateVectorEup emits one |
| Fused AndPop | 0x13b..0x14d | MLIR-emitted; the late decomposer splits it on v5+ (0x144 PushErfAndPop excluded from pseudo-EUP) |
| Deferred pop | 0x14e | kVectorEupResultValue; drains the push one+ bundle later (CreateVectorEupResult hardcodes it) |
Function Map
| Function | Address | Role |
|---|---|---|
LloLateDecomposer | 0x1269cb20 | splits fused AndPop → bare push + deferred pop |
DecomposeEupInstruction | 0x126a0340 | switch dispatch (19 cases) to 10 V*Decomposed builders |
LloOpcodeIsPseudoEupInstruction | 0x1d60c880 | classifies fused AndPop (range [0x13b,0x14d], mask 0x7fdff) |
VtanhDecomposed / VrsqrtDecomposed | 0x1d555040 / 0x1d557b60 | bare push + bare pop builders (23 lines each) |
CreateVectorEup | 0x1d4d78a0 | bare push (0x128..0x13a, 1 operand, asserts ProducesVreg) |
CreateVectorEupResult | 0x1d4d9820 | deferred pop, hardcodes 0x14e, asserts push ∈ 0x128..0x13a |
HasEupRestrictions (VF/GL/JF/PF) | 0x1c458620/0x1c458d80/0x1c457b80/0x1c4580c0 | v5+ separate-bundle constraint (TRUE/TRUE/FALSE/FALSE) |
DecomposeEupOperationsForBarnacore | 0x1269c5c0 | BarnaCore EUP split (separate result-drain path) |
Considerations
DecomposeEupOperationsForBarnacore (@0x1269c5c0) also calls DecomposeEupInstruction, but drains the EUP result through a BarnaCore EupResultRead address-handler rather than the TensorCore VectorResult pop — whether its push→pop distance differs from the late decomposer's was not separated (HIGH). The pseudo-opcode 0x144 kVectorPushErfAndPop, excluded from the pseudo-EUP set by the 0x7fdff mask, is not split by the late decomposer; its handling was not traced and is out of scope for the cost model here.
Latency vs Reservation Orthogonality
Purpose
The cost model bounds an EUP push by two independent quantities, read from two different mechanisms, and a reimplementer who conflates them gets the schedule wrong. The push→pop latency is the depth of the window the deferred pop sits behind — the work the scheduler interleaves to hide it. The issue reservation is how many bundles the EUP unit stays busy after a push, bounding back-to-back pushes. They compose as a max, not a product.
Algorithm
VectorEupReservationCycles (Target vtable +0x480) is the issue-occupancy — the minimum bundles from one push to the next:
function JellyfishTarget::VectorEupReservationCycles(): return 1 // 0x1d490660
function PufferfishTarget::VectorEupReservationCycles(): return 2 // 0x1d494cc0 — half-rate EUP
function ViperfishTarget::VectorEupReservationCycles(): return 1 // 0x1d49b060
function GhostliteTarget::VectorEupReservationCycles(): return 1 // 0x1d497ee0
This reservation is applied by the per-instruction resource model (the GetResourceUsage matrix + slot tracker), not by the push→pop latency edge. The latency edge — the minimum bundles from a push to its drain — is a separate dependency-graph weight; the two are orthogonal.
| quantity | meaning | source | JF | PF | VF | GL |
|---|---|---|---|---|---|---|
| push→pop latency | min bundles push → drain (the correction window) | dependency-graph edge weight | (clamp) | (heap array) | (heap array) | (heap array) |
VectorEupReservationCycles | min bundles push → next push | Target accessor (+0x480) | 1 | 2 | 1 | 1 |
GOTCHA — Pufferfish's reservation of 2 (half-rate EUP) does not double the push→pop latency. The latency is the window depth; the reservation is the issue rate. For a chain of
Mindependent transcendentals on Pufferfish, the EUP-bound schedule length is roughly2·(M−1) + latencybundles — the reservation spaces consecutive pushes two bundles apart, and the final pop drainslatencybundles after the last push. A cost model that multiplies the two (latency × reservation) over-counts the chain badly.
The absolute per-generation push→pop latency integers are documented on the EUP Latency Overview and EUP / Transcendental Slot pages (PF 7, VF 6, GL 13/14, JF clamp 4); they are read from per-instruction heap arrays, not from a single immediate. What this page fixes is the shape of the composition — max(latency-deadline, reservation-availability) — and that the unpack count N multiplies the number of push/pop pairs but not the per-pair latency.
Considerations
The minimum push→pop bundle gap (the latency-edge weight) was confirmed structurally — the decomposer builds the region and the scheduler places the pop after the deferred work — but the per-gen latency integer is not a single mov immediate in the split path; it derives from the EUP result-FIFO depth plus the dependency-graph latency model. The numeric latency per generation is therefore documented on the latency-overview / slot pages from the <Gen>Performance arrays, and is referenced here, not re-derived. The orthogonality of latency and reservation, and the N-fold push-count multiplier, are CERTAIN; the absolute push→pop distance is HIGH.
Worked Example — Packed bf16 rsqrt, Viperfish vs Ghostlite
A sc.rsqrt on vector<…×bf16> (16-bit packed sub-lane). Both generations register both the 1:1 UnaryFloatVector and the 1:N AluEp pattern; IsDynamicallyLegal picks, and the unpack count follows:
VIPERFISH (SupportsBf16AluInstructions = FALSE → 32-bit lane):
IsDynamicallyLegal: fmt(bf16)=1 (!=0), SupportsBf16Alu=FALSE → ILLEGAL → AluEp 1:N fires.
UnpackOperand: lane_bw = 32 > 16 (bf16) → one halving step:
count = bitwidth(result=32) / bitwidth(sub=16) = 2 pieces; loop exits (32 == 32).
⇒ N ≈ 2 × (shape / lane-tiling). Each F32 piece → its own bare 0x12c rsqrt push + 0x14e pop.
COST: N × (one EUP push/pop). The push count is DOUBLED by the unpack to F32.
GHOSTLITE (SupportsBf16AluInstructions = TRUE → 16-bit lane, elem bw == 16):
IsDynamicallyLegal: fmt!=0, SupportsBf16Alu=TRUE, elem==16 → LEGAL → NO unpack, 1:1 macro fires.
tpu_rsqrt_macro → fused 0x149 kVectorRsqrtBf16AndPop; LloLateDecomposer (HasEupRestrictions=TRUE)
splits it into a bare 0x136 BF16 rsqrt push + 0x14e pop. ONE push/pop; the BF16 ALU runs the
per-lane op natively.
COST: 1 × (one EUP push/pop). No unpack.
⇒ The newer generation (GL) processes the packed bf16 rsqrt at HALF the EUP push count of VF,
because the native BF16 lane lets it skip the unpack to F32 entirely. The unpack count N is the
cost-model multiplier; SupportsBf16AluInstructions is the single boolean that sets it.
Cross-References
- EUP / Transcendental Slot — the push/pop encoding (VALU slot 3, 5-bit function selector,
0x14epop); this page is the cost (unpack count + split) half - EUP Latency Overview — the per-generation push→pop latency integers this page treats as orthogonal to the reservation
- EUP Correction Coefficients — where the transcendental correction math actually lives (
*NoEupF32fallbacks, Newton/rational helpers), since theV*Decomposedsplit carries none - Pack/Unpack Precision — the
VpackFormatmodel and thekVectorUnpack/kVectorPackLLO ops the AluEpUnpackFOp/PackFOplower to - VPU (Vector-ALU) Slot — the VALU slot family; the EUP push is
Alu3-only and the interleaved correction work runs on these VALU slots