Reduce-Window / Pooling Cost
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (buildlibtpu_lts_20260413_b_RC00, BuildID md589edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, not stripped)..text/.rodataVMA == file offset;.data.rel.roVMA − 0x200000 == file offset. The binary retains full demangled C++ symbols. Other libtpu builds will differ.
Abstract
RecordReduceWindowCycles is the cost-model emitter that prices an HLO kReduceWindow — max-pool, average-pool, or any windowed reduction. It is the pooling sibling of the convolution emitter RecordConvolutionCycles: both reuse the same per-call ConvState dim-product struct, the same Target geometry divisors (SublaneCount, ChunksPerTile), and the same four-ResourceVector output protocol. The difference is the destination. A convolution lands its compute cost on the MXU pipes (Matmul/Matpush, priced through MxuLatencyTable); a reduce-window is a vector reduction, not a matrix multiply, so every compute cycle lands on the vector / cross-lane pipes — VectorLoad, VectorAluAny, and Xlu. There is no MXU traffic.
The familiar reference frame is an LLVM TargetTransformInfo cost hook for a reduction intrinsic, except that this model is throughput-shaped, not latency-shaped: the cost is "how many cycles does the vector unit stay busy reading and combining the window," computed as a per-output-element count multiplied by a per-op throughput drawn from the per-generation CycleTable. The window area enters the cost through the ConvState dim products, and the combiner — the to_apply subcomputation that XLA attaches to every reduce-window (the maximum of a max-pool, the add of an avg-pool) — is priced separately by walking its instructions and charging each modeled op once per window-element per output.
The shape of the cost is set by which axis the window reduces over. GetReduceWindowType classifies the reduction into Lane (most-minor axis), Sublane (the sublane axis), or Major (a slow/outer axis), and each gets a distinct leaf emitter with a distinct deposit pattern: Lane adds a cross-lane Xlu drain, Sublane adds a sublane-shuffle term, and Major is purely read-plus-combine. This page documents the dispatch, the ConvState-driven window-area formula each leaf uses, the per-op throughput it multiplies by, and the combiner-cost helper they share.
For reimplementation, the contract is:
- The four-
ResourceVectorentry protocol and the trivial-zero short-circuit when the cost model is not live. - The
ConvStatebuild (shared with conv) and the three namedConvStatedim products each leaf reads: output-volume, window-extent, and kernel-spatial-iteration count. - The
GetReduceWindowType3-way axis dispatch and the slot deposits each leaf makes (VectorLoad,VectorAluAny,Xlu), including theTarget+0x4b0Xlu-rate divisor. - The combiner cost: walk
to_apply, classify each modeled combiner opcode → aCycleTable::Instruction, and chargethroughput(CT) × countwherecountis the per-axis window-element count (vol·(window−1)for Lane,base·(window−1)plus a4·baseterm for Sublane, the wholevolfor Major).
| Class | xla::jellyfish::CostModel (cost_model.cc) |
| Entry (4-RV) | RecordReduceWindowCycles @0x130b5ec0 |
| Axis dispatch | fusion_util::GetReduceWindowType @0x1454d4a0 → −1 / 0 / 1 / 2 |
| Lane leaf | RecordLaneReduceWindowCycles @0x130c97e0 (cost_model.cc:4931) |
| Sublane leaf | RecordSublaneReduceWindowCycles @0x130c9c60 (cost_model.cc:4978) |
| Major leaf | RecordMajorReduceWindowCycles @0x130c9f00 (cost_model.cc:5033) |
| Combiner cost | UpdateCostBasedOnReductionFunction @0x130c9a20 (fatal @ cost_model.cc:6870) |
| Deposit | ResourceVector::Acc(Resource, double) @0x1c89adc0 |
| Throughput source | per-gen CycleTable::GetCyclesForThroughput(Instruction) (via [CostModel+0x8]=CycleTable, then vtable slot [+0x10]) |
| Compute pipes | R[7] VectorLoad, R[5] VectorAluAny, R[2] Xlu — no MXU |
| Input DMA | RecordOperandCycles @0x130ca140 → R[9..12] MemXfer |
The Entry Point and Output Protocol
Purpose
RecordReduceWindowCycles is invoked from the output-fusion cost path as one of several priced sub-emitters (the conv/dot/reduce-window family). It takes the reduce-window HLO, three Window descriptors (activations / a second window / output), and four ResourceVector out-parameters, and deposits the pooling cost across them. The four-vector form mirrors RecordConvolutionCycles: the caller later folds the four sub-vectors with MaxResourceCycles' overlap rules (see resource-enum).
Entry Point
RecordReduceWindowCycles (4-RV) @0x130b5ec0 ── prices a kReduceWindow
├─ convolution_util::GetConvLikeProperties ── derive ConvolutionDimensionNumbers + shapes
├─ window_util::HasBaseDilation ── dilation gate (blocks Lane/Sublane)
├─ (inline) build ConvState at rbp-0x1d0 ── Product(dim)/SublaneCount/ChunksPerTile
├─ fusion_util::GetReduceWindowType @0x1454d4a0 ── 0=Lane 1=Sublane 2=Major (−1=not lowerable)
├─ RecordLaneReduceWindowCycles @0x130c97e0 ── axis 0
├─ RecordSublaneReduceWindowCycles @0x130c9c60 ── axis 1
├─ RecordMajorReduceWindowCycles @0x130c9f00 ── axis 2
├─ RecordOperandCycles @0x130ca140 ── input DMA → R[9..12]
└─ RecordTopLevelConvolutionOutputCycles @0x130bcb80 ── output DMA (when top-level)
Algorithm
function RecordReduceWindowCycles(rw, act_win, win2, out_win, rv0, rv1, rv2, rv3, fusion, top): // @0x130b5ec0
CHECK(rw->opcode() == kReduceWindow) // opcode byte +12 == 0x5e; cost_model.cc:5068
if (this->cost_model_live == 0): // CostModel byte +0x14 (the [a1+20] gate)
// TRIVIAL path: not modellable — deposit zero into the four vector/Xlu slots and return.
rv2.Acc(VectorLoad=7, 0.0) // 0x130b5fc3..6007 (vpxor → 0.0)
rv2.Acc(VectorAlu0=3, 0.0)
rv2.Acc(VectorAlu1=4, 0.0)
rv2.Acc(Xlu=2, 0.0)
return Ok
// REAL path
props = GetConvLikeProperties(rw) // ConvolutionDimensionNumbers + operand/result shapes
dilated = HasBaseDilation(act_win, dim=0) // window_util::HasBaseDilation → blocks Lane/Sublane
CHECK(rw_state.input_batch_is_most_minor) // cost_model.cc:5093
CHECK(!rw_state.output_feature_is_most_minor) // cost_model.cc:5094
state = BuildConvState(props, out_win, ...) // dim products / SublaneCount / ChunksPerTile (rbp-0x1d0)
// incl. the 64-bit window-volume multiply loop @0x130b6290
switch (GetReduceWindowType(rw)): // @0x1454d4a0
case 0: CHECK(!dilated) // cost_model.cc:5142
RecordLaneReduceWindowCycles(rw, axis_minor, state, rv2)
case 1: CHECK(!dilated) // cost_model.cc:5149
RecordSublaneReduceWindowCycles(rw, axis_minor, state, rv2)
case 2: RecordMajorReduceWindowCycles(rw, axis_minor, state, rv2)
// case −1 (not conv-lowerable) is filtered upstream by the fusion sentinel; not reached here.
if (top): // top-level (non-fused) reduce-window
RecordOperandCycles(rw->operand(0), op_state, ...) // input DMA → rv (R[9..12])
RecordTopLevelConvolutionOutputCycles(rw, out_win, ...) // output DMA → R[11..12]
return Ok
Function Map
| Function | Address | Role |
|---|---|---|
CostModel::RecordReduceWindowCycles (4-RV) | 0x130b5ec0 | top emitter — CHECK, ConvState build, axis dispatch |
fusion_util::GetReduceWindowType | 0x1454d4a0 | axis classifier — −1 / 0 / 1 / 2 |
CostModel::RecordLaneReduceWindowCycles | 0x130c97e0 | lane-axis leaf — VectorLoad + combine + Xlu drain |
CostModel::RecordSublaneReduceWindowCycles | 0x130c9c60 | sublane-axis leaf — adds sublane shuffle |
CostModel::RecordMajorReduceWindowCycles | 0x130c9f00 | major-axis leaf — read-heavy, no Xlu |
CostModel::UpdateCostBasedOnReductionFunction | 0x130c9a20 | combiner-op cost (shared by all three leaves) |
CostModel::RecordOperandCycles | 0x130ca140 | input-DMA deposit → R[9..12] MemXfer |
CostModel::RecordTopLevelConvolutionOutputCycles | 0x130bcb80 | output-DMA deposit (top-level only) |
ResourceVector::Acc(Resource, double) | 0x1c89adc0 | the slot deposit primitive (bound < 23) |
convolution_util::GetConvLikeProperties | 0x13190bc0 | shared conv/RW dim-number + shape extraction |
window_util::HasBaseDilation | (call @0x130b62..) | dilation gate blocking Lane/Sublane |
NOTE — the trivial-zero short-circuit (
CostModel+0x14 == 0) is not a no-op: it still deposits four explicit0.0cycles intoVectorLoad/VectorAlu0/VectorAlu1/Xlu. The deposit count must match a real pooling op even when the cost is zero, because downstream the fourResourceVectors are folded positionally; a reimplementation that simplyreturns without the four zero-Acccalls leaves those slots uninitialized for the fold. The deposit goes into the third out-vector (a9/rv2), which is also where every real-path leaf deposits.
ConvState — the Shared Dim-Product Struct
Purpose
A reduce-window does not build a ConvCostState. It builds the smaller per-call ConvState inline, from the ConvolutionDimensionNumbers and the output Window, by exactly the convolution recipe: each dim field is Product(dim_chunk_sizes) divided by a Target geometry constant (SublaneCount() = 8, ChunksPerTile() = 16), paired with a divisibility remainder. The same struct, the same divisions — see convolution-cost-state for the full field map. The reduce-window leaves read only three of its fields, and the binary names them through CHECK strings and VLOG arguments.
The three fields the pooling cost reads
The three leaf emitters index ConvState as a _QWORD array (a4[i] = ConvState + 8*i). Only three offsets carry the pooling cost, and the binary identifies each:
ConvState offset | a4[i] | Named role | Evidence |
|---|---|---|---|
+0x18 | a4[3] | output-spatial dim product | VLOG arg 3 (", %ld" in Lane/Sublane @cost_model.cc:4931/4978) |
+0x20 | a4[4] | kernel-spatial-dims iteration count | CHECK string rw_state.kernel_spatial_dims_iteration_count == 1 (Major @cost_model.cc:5033) |
+0x68 | a4[13] | output dim product / window extent (per-axis) | VLOG arg 2 (Lane/Sublane) + combiner (a4[13]−1) |
+0x70 | a4[14] | window-extent / iteration count (the reduced axis) | VLOG arg 1 (Lane/Sublane) + combiner (a4[14]−1) |
The CHECK at cost_model.cc:5033 is the one hard name: in the Major path, ConvState+0x20 (a4[4]) must equal 1 and is explicitly labeled the kernel-spatial-dims iteration count. That is the conv "how many kernel-window positions iterate" field; a major-axis reduce-window has its window folded into a single major iteration, so it is pinned to 1. The other three are named from the order in which the Lane/Sublane VLOG prints them (win_extent, output, output_spatial) and from which one the combiner decrements by 1 (the window extent of the reduced axis).
QUIRK — the pair
(+0x68, +0x70)is the reduced-axis pair. In the Lane leaf the VectorLoad volume is built from+0x70 · +0x18 · +0x20and the combiner window is+0x68; in the Sublane leaf the VectorLoad volume includes both+0x68and+0x70and the combiner window is+0x70. The two leaves read the same struct but treat opposite members of the pair as "the window being reduced," because Lane and Sublane reduce over physically different axes. A reimplementation must not assume a single canonical "window-size" field — the window dimension is axis-dependent.
Axis Dispatch — GetReduceWindowType
Purpose
The cost shape is set by which physical axis the window sweeps. GetReduceWindowType (@0x1454d4a0, namespace fusion_util) returns the axis class. It first asks IsConvLowerableReduceWindow; if the reduce-window cannot be lowered to the conv path it returns −1 (0xFFFFFFFF). Otherwise it maps the layout to physical order (LayoutUtil::MakeLogicalToPhysical) and inspects which window dimensions are non-trivial (window_util::IsTrivialWindowDimension).
Algorithm
function GetReduceWindowType(rw): // @0x1454d4a0
if !IsConvLowerableReduceWindow(rw): return -1 // not modellable via conv path
phys = MakeLogicalToPhysical(operand_layout)
for each non-most-minor physical dim d: // the "outer/major" dims (the loop)
if !IsTrivialWindowDimension(window[d]):
return 2 // Major: window on a slow axis
// window is confined to the two most-minor physical dims (minor0 = most-minor / lane axis)
if !IsTrivialWindowDimension(window[minor0]): return 0 // Lane: lane axis has a window
if !IsTrivialWindowDimension(window[minor1]): return 1 // Sublane: lane trivial, sublane non-trivial
return 2 // Major: both most-minor dims trivial
The master emitter dispatches on the result with if (type) { if (type != 1) { if (type == 2) Major; } else Sublane; } else Lane; — so the mapping is 0 → Lane, 1 → Sublane, 2 → Major, confirmed at the three call sites in @0x130b5ec0. (Note the decompile reaches the return 0/return 1/return 2 for the most-minor pair via a shared tail: when window[minor0] is non-trivial it falls straight through to return IsTrivial(window[minor0]) == 0 = Lane; when minor0 is trivial it tests minor1, returning IsTrivial(window[minor0]) == 1 = Sublane if minor1 is non-trivial, else holding result = 2 = Major.)
GOTCHA — there are two distinct reduce-window "type −1/2" filters in the cost model and they are not the same gate. The output-fusion dispatch (in
GetOutputFusionOrConvolutionCycles) uses a sentinel to skip reduce-windows of type 2 or −1 before any state is built. The emitter here is reached only as a priced sub-emitter, and at that point all three axis types deposit (Lane, Sublane, Major). Do not carry the upstream skip-logic into the leaf dispatch. A type −1 never reaches this function (it is filtered upstream); thecase −1branch is therefore not exercised here.
The Three Leaf Emitters
All three leaves share the signature Record<Axis>ReduceWindowCycles(rw, axis_minor:int, ConvState&, ResourceVector*) and the deposit grammar: throughput cycles into named ResourceVector slots via Acc. They differ in the volume formula and which extra pipes they touch. The per-op throughput is CycleTable::GetCyclesForThroughput(CT) — read through the CostModel's CycleTable vtable slot [+0x10] — where CT is a CycleTable::Instruction ordinal. The integers behind each CT are per-generation and live in vf-cycletable / jf-cycletable.
Lane — RecordLaneReduceWindowCycles @0x130c97e0
The lane-axis reduction reduces over the most-minor (lane) axis, so after the per-element combine it must drain the partial results across lanes on the cross-lane unit (Xlu). This is the only leaf with an Xlu term.
function RecordLaneReduceWindowCycles(rw, axis_minor, st, rv): // @0x130c97e0
vol = st[+0x20] * st[+0x18] * st[+0x70] // a4[4]·a4[3]·a4[14] — output volume read into VMEM
window = st[+0x68] // a4[13]
rv.Acc(VectorLoad=7, (double)vol) // R[7]: input read into VMEM
// F16 pre-packed-feed special case (operand is f16 AND not a kBitcast-class feed, opcode != 0x3d):
if element_type(operand0) == F16(16) && operand0.opcode != 0x3d:
cyc = GetCyclesForThroughput(CT 0x16) // vtable [CycleTable+0x10]
rv.Acc(VectorAluAny=5, cyc * vol) // R[5]: extra per-element unpack-combine
UpdateCostBasedOnReductionFunction(to_apply(rw), vol * (window - 1), rv) // combiner cost
drain = GetCyclesForThroughput(CT 0x1b) // CT 0x1b = Xlu-class throughput
rv.Acc(Xlu=2, drain / Target[+0x4b0]) // R[2]: cross-lane reduction drain
// F16 residual unpack (only when axis_minor == 0 and the *result* is f16):
if axis_minor == 0 && element_type(result) == F16(16):
rv.Acc(VectorAluAny=5, ...) // R[5]
R[7] VectorLoad=vol=ConvState[+0x20]·[+0x18]·[+0x70]— the input volume streamed into the vector unit.R[2] Xlu=throughput(CT 0x1b) / Target[+0x4b0]— the bare CT-0x1b throughput divided by the divisor; there is no additional spatial multiplier in the decompiled deposit (@0x130c97e0:GetCyclesForThroughput(27)→ numerator,[Target+0x4b0]→ denominator, singlevdivsd).Target+0x4b0is the vector-ISA Xlu/cross-lane rate divisor, populated inTarget::InitfromTpuSequencerParts::vector_isa()— not aConvCostStatefield (see correction below). It is the same divisor the conv emitter uses for itsXlumxres deposit.- The combiner count is
vol · (window − 1)— the reduction combineswindow − 1times per output element (a tree reduce over the window touches each of thewindowinputs but performswindow − 1combines).
Sublane — RecordSublaneReduceWindowCycles @0x130c9c60
The sublane reduction reduces across the 8 sublanes. Instead of an Xlu drain it charges a sublane-shuffle term on VectorAluAny (the shuffle/broadcast family lands on R[5], per resource-enum), and it invokes the combiner twice — once for the window reduction and once with a fixed 4× factor for the cross-sublane combine tree.
function RecordSublaneReduceWindowCycles(rw, axis_minor, st, rv): // @0x130c9c60
base = st[+0x20] * st[+0x18] * st[+0x68] // a4[4]·a4[3]·a4[13]
window = st[+0x70] // a4[14]
vol = base * window // full output volume
rv.Acc(VectorLoad=7, (double)vol) // R[7]: input read
if element_type(operand0) == F16(16) && operand0.opcode != 0x3d: // same F16 gate as Lane
cyc = GetCyclesForThroughput(CT 0x16)
rv.Acc(VectorAluAny=5, cyc * vol)
UpdateCostBasedOnReductionFunction(to_apply(rw), base * (window - 1), rv) // window combine
SublaneCount() // = 8 (consulted, result folded into the shuffle)
r = CycleTable::GetResource(CT 0x15) // CT 0x15 → R[5] VectorAluAny (shuffle family)
rv.Acc(r, <shuffle cycles>) // R[5]: sublane shuffle
UpdateCostBasedOnReductionFunction(to_apply(rw), 4 * base, rv) // cross-sublane combine tree (×4)
if axis_minor == 0 && element_type(result) == F16(16):
rv.Acc(VectorAluAny=5, ...) // R[5] F16 residual
QUIRK — the second combiner call uses a literal
4 × basefactor (@0x130c9c60,4 * v9), notbase × (window−1). This4is the cost of the cross-sublane reduction tree, charged independently of the window size — sublane reductions fan in over a fixed shuffle network whose depth the model prices as a constant4combiner passes. A reimplementation that scales this term by the sublane count (8) or by the window will mis-price every sublane-axis pool.
Major — RecordMajorReduceWindowCycles @0x130c9f00
A major-axis reduce-window reduces over a slow/outer dimension, so it is dominated by the cost of reading the (large) window volume; there is no cross-lane or cross-sublane drain.
function RecordMajorReduceWindowCycles(rw, axis_minor, st, rv): // @0x130c9f00
CHECK(st[+0x20] == 1) // "rw_state.kernel_spatial_dims_iteration_count == 1" cost_model.cc:5033
props = GetConvLikeProperties(rw)
window_vol = Product(window_dim.size for each window dimension) // 64-bit multiply over WindowDimension list
vol = st[+0x68] * st[+0x70] * st[+0x18] * window_vol // a4[13]·a4[14]·a4[3]·window_vol
rv.Acc(VectorLoad=7, (double)vol) // R[7]: read the whole window volume
UpdateCostBasedOnReductionFunction(to_apply(rw), vol, rv) // combiner: once per element of vol
R[7] VectorLoadcarries the fulloutput_volume · window_volumeread. The combiner count is the entirevol(no−1decrement), because a major reduction's combine count is folded directly into the volume product rather than separated into a per-output window-minus-one term.- No
Xlu, no shuffle, no F16 residual: a major reduce-window is read-plus-combine only.
The Combiner Cost — UpdateCostBasedOnReductionFunction
Purpose
Every reduce-window carries a to_apply subcomputation — maximum for max-pool, add for sum/avg-pool, or an arbitrary user reduction. UpdateCostBasedOnReductionFunction (@0x130c9a20, shared by all three leaves) prices the combiner itself: it walks the subcomputation's instruction list, and for each modeled combiner op (skipping parameter/get-tuple-element) it maps the opcode to a CycleTable::Instruction and charges throughput(CT) × count into VectorAluAny. The loop does not stop after the first op — every modeled instruction in to_apply contributes a deposit, so a single-op reduction (the common maximum/add) charges once while a multi-op user reduction charges once per modeled op. The count is supplied by the caller (the window-element-per-output count computed above).
Algorithm
function UpdateCostBasedOnReductionFunction(comp, count, rv): // @0x130c9a20
for inst in comp->instructions(): // walks ALL instructions; deposits per modeled op
switch (inst->opcode()): // opcode byte +12
case 0x29 ')' , 0x52 'R': continue // parameter / get-tuple — no cost, skip
case 0x49 'I', 0x4a 'J': // (min/max-class combiner)
Resource = VectorAluAny(5); CT = 0x20
case 0x4b 'K': // multiply-class
Resource = VectorAluAny(5); CT = 0x14 // CT 0x14 for both int and float
if ElementIsFloating(operand): Resource = GetResource(CT)
default:
if (opcode != 0x03 add): FATAL "Reduction Function not modeled: Please file an XLA bug." // :6870
Resource = VectorAluAny(5); CT = 0x12 // CT 0x12 (add)
if ElementIsFloating(operand): Resource = GetResource(CT) // remap to the float-op slot
cyc = GetCyclesForThroughput(CT) // vtable [CycleTable+0x10]
rv.Acc(Resource, cyc * count) // the combiner deposit, charged for this op
// loop continues to the next instruction — every modeled combiner op in `comp` is charged
The opcode→CT classification (@0x130c9a20):
| Combiner opcode | Class | CT (float) | CT (int) | Resource |
|---|---|---|---|---|
0x49/0x4a (min/max — the max-pool case) | compare-select | 0x20 | 0x20 | R[5] VectorAluAny (no float remap) |
0x4b (multiply) | mul | 0x14 | 0x14 | R[5]; float remaps slot via GetResource(0x14) |
0x03 add (the avg/sum-pool case) | add | 0x12 | 0x12 | R[5]; float remaps slot via GetResource(0x12) |
0x29 (parameter) / 0x52 (get-tuple-element) | structural | — | — | (no deposit) |
| anything else | — | FATAL (cost_model.cc:6870) | — | — |
NOTE — the count argument is the only thing that ties the combiner to the window area. Lane passes
vol·(window−1), Sublane passesbase·(window−1)plus a separate4·base, Major passes the wholevol. Soto_apply × window_volume × output— the flop the HLO-levelHandleReduceWindowestimator charges — appears here asthroughput(CT) × count: the combiner runs once per window-element per output, priced at the per-generation throughput of the specific combiner op (max vs add vs multiply), not as a fixed flop.
GOTCHA — The Xlu-rate divisor
Target+0x4b0(and the matmul-rate divisorTarget+0x4ac) areTargetfields, notConvCostStatefields — chip-wide vector-ISA rates populated inTarget::InitfromTpuSequencerParts::vector_isa()(vector_isa[+0xc]→Target+0x4acand+0x4b0). The Lane leaf reachesTarget+0x4b0through theCostModel'sCycleTable([CostModel+0x8]=CycleTable, then itsTargetpointer, then[Target+0x4b0]) at@0x130c98f6— the same vector-ISA Xlu rate the conv emitter divides its mxres deposit by, not a per-reduce-window derate.
How a Pool Is Priced — End to End
Putting the pieces together, a reduce-window's compute cost lands entirely on the vector/cross-lane pipes:
max-pool / avg-pool cost (per the three leaves)
R[7] VectorLoad = output_volume (input read into VMEM)
R[5] VectorAluAny = throughput(combiner CT) · combiner_count (the to_apply op, once per window-elem)
+ throughput(CT 0x16) · vol (F16 unpack, only when operand is f16)
+ sublane shuffle (Sublane axis only, via GetResource(CT 0x15))
R[2] Xlu = throughput(CT 0x1b) / Target[+0x4b0] (Lane axis only — cross-lane drain)
R[9..12] MemXfer = RecordOperandCycles(input) (+ output DMA when top-level)
The bundle cost then comes from MaxResourceCycles (resource-enum): the three VectorAlu* slots blend at 50% port-balance, the four MemXfer slots serial-sum, and VectorLoad/Xlu join the plain-MAX group. So a pooling op is bound by whichever dominates — the input read (VectorLoad), the combiner throughput (VectorAluAny), the cross-lane drain (Xlu), or the input DMA (MemXfer).
The contrast with the conv emitter is exact and instructive: the conv leaf RecordConvKernelCycles deposits into R[0] Matpush, R[1] Matmul, and R[2] Xlu (priced through MxuLatencyTable at the matmul/matpush rate), because a convolution is a systolic matrix multiply. The reduce-window emitter reuses the same ConvState, the same Target+0x4b0 Xlu divisor, and the same four-ResourceVector protocol — but deposits onto VectorLoad/VectorAluAny/Xlu, never the MXU pipes, because a windowed reduction is a vector operation. Same scaffolding, different functional unit.
Considerations
- Dilation blocks Lane/Sublane. If
HasBaseDilationis true, the Lane and Sublane leavesCHECK-fail (has_base_dilation == false, cost_model.cc:5142/5149). A base-dilated window is therefore only modellable as a Major reduce-window; a reimplementation must route dilated pools to the major path or it will trip the assertion. - Layout pins the axis.
GetReduceWindowTypedepends on the physical layout (MakeLogicalToPhysical), so the same logical reduce-window can classify Lane / Sublane / Major depending on which dimension the layout assignment made most-minor. The cost is layout-sensitive by construction. - The two CHECKs on minor-ness (
input_batch_is_most_minor,!output_feature_is_most_minor) gate the whole real path. They encode the assumption that a conv-lowerable reduce-window keeps batch most-minor and feature not-most-minor; an HLO violating either is not priced by this emitter. - Per-gen throughput. Every
CT 0x12/0x14/0x16/0x1b/0x15/0x20resolves to a per-generation integer viaGetCyclesForThroughput. The combiner-op CTs (0x12add,0x14mul,0x20min/max) are generation-invariant in which slot they hit (VectorAluAny) but not in how many cycles — those live on the per-gen cycle-table pages.
Related Components
| Name | Relationship |
|---|---|
convolution-cost-state | the ConvState/ConvCostState structs this emitter reuses; the conv sibling RecordConvKernelCycles |
resource-enum | the 23-slot ResourceVector, the Acc deposit, and the MaxResourceCycles fold |
mxu-latency-overview | the MXU occupancy model the conv path uses and the pooling path does not |
vf-cycletable / jf-cycletable | the per-CT throughput integers GetCyclesForThroughput returns |
window-description-cost | the conv/DMA byte+throughput primitive behind RecordOperandCycles |
slot-vpu | the LLO VALU slot whose vadd/vmax/.xlane ops the combiner CTs price |
Cross-References
- ConvolutionCostState — the
ConvState/ConvCostStatefield map and the conv emitter this is the pooling sibling of - WindowDescription Byte-Cost — the conv/DMA byte+throughput primitive
RecordOperandCyclesbuilds on - VfCycleTable — the per-generation
CycleTable::Instruction→ throughput integers the combiner and drain CTs resolve through - JF CycleTable — the Jellyfish/Dragonfish throughput table
- Resource Enum (23-slot) —
VectorLoad/VectorAluAny/Xluslots,Acc, and theMaxResourceCyclesoverlap reduction - MXU Latency Overview — the MXU reservation model the conv path prices through and the pooling path bypasses
- Performance Overview — the base per-opcode latency grid behind
GetCyclesForThroughput - VPU (Vector-ALU) Slot — the LLO VALU/
.xlaneops the combiner and reduction CTs charge against