GetHloResources Routing
Addresses apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (BuildID md589edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, not stripped). Every symbol below is a demangled C++ name..text/.rodataVMA == file offset;.data.rel.roVMA − 0x200000 == file offset. All addresses are VMA. Other versions differ.
Abstract
CostModel::GetHloResourcesImpl is the routing spine of the TPU bundle-occupancy cost model: given one HLO instruction, it decides which cost sub-emitter prices the op, and that sub-emitter deposits the op's throughput cycles into the 23-slot ResourceVector. It is the resource-routing counterpart to the flop/byte side. The flop side (TpuHloCostAnalysis) walks the same HLO and writes float properties (+0x50 flops / +0x54 transcendentals / +0x58 bytes) consumed by the fusion-priority model; this page is the other surface — the per-op cycle deposit into a VLIW-bundle resource vector. The two share the HLO and agree on which ops are expensive, but they are computed by different functions, store into different structures, and never share a number.
Routing happens in three nested stages, and the page is organized around them. Stage 1 — GetHloResourcesImpl (@0x130aa580) is a five-way top-level dispatch: a collective goes to the network model; a conv-lowerable op (or a non-max-pool reduce-window) goes to the MXU output-fusion model; a collective-compute fusion goes to its own model; everything else falls to the loop/elementwise default. Stage 2 — RecordCyclesIfFused (@0x130cc720) is the fusion-peel router reached only from the default arm: for a fused region it peels convolution, reduce-window, loop-fusion and output-fusion roots to their dedicated emitters, dropping producer→consumer input-DMA edges as it goes, and falls through to the leaf emitter for every remaining op. Stage 3 — RecordHloCycles (@0x130bbfe0) is the leaf per-op deposit: a switch over the opcode, compiled to a 0x7f-entry self-relative jump table at .rodata 0xae0ebbc, routes each opcode to the CycleTable::Instruction it issues and the named Resource slot it lands in.
The familiar reference frame is an LLVM TargetTransformInfo cost path where each opcode reaches a getInstructionCost arm, except that here the "cost" is a resource occupancy deposited into a structured bundle vector that a later MaxResourceCycles reduction folds with explicit overlap rules. This page documents the three dispatch layers, the predicates and masks that gate each arm, the leaf opcode→slot routing, and how each routed sub-model writes into ResourceVector::Acc. The flop-override table the TpuHloCostAnalysis page owns is not duplicated here.
For reimplementation, the contract is:
- The five-way
GetHloResourcesImpldispatch: the collective predicate (with its opcode-61 collective-in-fusion peel), the numeric element-type bitmask gate, the conv/reduce-window selection with the max-pool sentinel, and the collective-compute-fusion vs loop/elementwise default. - The
RecordCyclesIfFusedpeel order — conv → reduce-window → loop-fusion → output-fusion → leaf — and theIsProducerUseedge-internalisation that runs before the per-leaf deposit. - The
RecordHloCyclesopcode→slot jump table: theidx = opcode − 3index, theja > 0x7e → defaultbound, and for every named block theCycleTable::Instructionissued and theResourceslot it lands in viaCycleTable::GetResource. - The leaf deposit mechanic:
slot = GetResource(CT),thru = cycletable.vtable[+0x10](CT),Acc(slot, element_count × thru × W).
| Class | xla::jellyfish::CostModel (cost_model.cc) |
| Stage 1 — top dispatch | CostModel::GetHloResourcesImpl @0x130aa580 (5-way) |
| Stage 2 — fusion peel | CostModel::RecordCyclesIfFused @0x130cc720 |
| Stage 3 — leaf deposit | CostModel::RecordHloCycles @0x130bbfe0 |
| Leaf jump table | .rodata 0xae0ebbc (0x7f × i32 self-relative; target = 0xae0ebbc + off) |
| Op→slot map | CycleTable::GetResource(Instruction) @0x1c89ce20 → dword_B438AEC[CT] |
| Deposit primitive | ResourceVector::Acc(Resource, double) @0x1c89adc0 (bound < 0x17) |
| Public entry | CostModel::GetHloResources @0x130aa560 (thin wrapper) |
| Source file | platforms/xla/service/jellyfish/cost_model/cost_model.cc |
Stage 1 — GetHloResourcesImpl, the Five-Way Top Dispatch
Purpose
GetHloResourcesImpl is the front door: it returns a StatusOr<ResourceVector> for one HLO op by selecting one of five pricing paths. It does not itself deposit cycles — each arm delegates to a sub-emitter that builds and fills the vector. The arms, in evaluation order: collective → GetCollectiveCycles; conv-lowerable / non-max-pool reduce-window → GetOutputFusionOrConvolutionCycles; collective-compute fusion → GetCollectiveComputeFusionCycles; default → GetLoopFusionOrUnfusedHloCycles (the only arm that reaches RecordCyclesIfFused / RecordHloCycles).
Entry Point
GetHloResources @0x130aa560 ── public wrapper
└─ GetHloResourcesImpl @0x130aa580 ── StatusOr<ResourceVector>, 5-way
├─ IsSupportedCollectiveHlo @0x130aeda0 ── arm 1 (+ opcode-61 fusion peel)
│ → GetCollectiveCycles @0x130abfc0 rv[+8] = scalar cycles
├─ IsFusionSupportedHlo @0x130abee0 ── numeric-type + fusion-support gate
├─ IsConvLowerable @0x14553620 ┐
├─ ExtractConvLikeHlo @0x1d6aa140 ├ conv / reduce-window selection
├─ GetReduceWindowType @0x1454d4a0 ┘ (type −1/2 max-pool → NOT conv-priced)
│ → GetOutputFusionOrConvolutionCycles @0x130aede0 ── arm 2 (MXU)
├─ IsCollectiveComputeFusion @0x13e028c0
│ → GetCollectiveComputeFusionCycles @0x130b13a0 ── arm 3
└─ (default)
→ GetLoopFusionOrUnfusedHloCycles @0x130b2bc0 ── arm 4 → Stage 2
Algorithm
StatusOr<ResourceVector> GetHloResourcesImpl(inst, opts, fs, isFused): // sub_130AA580
op = *(byte*)(inst + 12); // HloInstruction::opcode
// (1) COLLECTIVE — all-reduce / all-gather / … priced by the network path.
if IsSupportedCollectiveHlo(inst): // sub_130AEDA0
return GetCollectiveCycles(inst, &rv); // sub_130ABFC0 — rv[+8] = scalar cycles
if op == 61 (fusion): // peel a collective wrapped in a fusion
comp = inst.fused_instructions_computation();
c = first fused instr where IsSupportedCollectiveHlo(c) holds;
if c found and !IsCollectiveComputeFusion(inst):
return GetCollectiveCycles(c, &rv); // price the inner collective
// (2) NUMERIC ELEMENT-TYPE + FUSION-SUPPORT GATE
et = inst.shape.element_type; // *(u32*)*(inst+88)
if (et <= 0x21 and bit(0x2FFF91FFE, et)) // numeric/packed-type mask (bittest64)
or (et & 0xFFFFFFFE) == 0x20 // (the F8 pair 0x20/0x21)
or (et <= 0x22 and bit(0x400048000, et)): // the loop/output-fusion peel mask
if IsFusionSupportedHlo(inst, opts.target): // sub_130ABEE0
// (3) CONV / REDUCE-WINDOW MAIN-OP SELECTION
convlowerable = IsConvLowerable(inst); // sub_14553620
conv = ExtractConvLikeHlo(inst);// sub_1D6AA140
priced = inst;
if conv and conv.opcode == 94 (reduce-window):
t = GetReduceWindowType(conv); // sub_1454D4A0
if t != -1 and t != 2: // 0/1 (lane/sublane) → conv-priced
if convlowerable:
return GetOutputFusionOrConvolutionCycles(inst, opts, fs); // sub_130AEDE0
priced = inst; // fall to collective-compute / default
else if convlowerable:
return GetOutputFusionOrConvolutionCycles(inst, opts, fs);
// (4) COLLECTIVE-COMPUTE FUSION (collective fused with compute)
if IsCollectiveComputeFusion(priced): // sub_13E028C0
return GetCollectiveComputeFusionCycles(priced, opts, fs); // sub_130B13A0
// (5) DEFAULT
return GetLoopFusionOrUnfusedHloCycles(priced, opts, fs); // sub_130B2BC0
// non-numeric / unsupported types → empty ResourceVector (Ok, all-zero)
return Ok(empty_rv);
The Three Gates That Shape the Dispatch
The numeric element-type gate (@0x130aa9dc). Before any conv/fusion work, the dispatch tests inst.shape.element_type against cmp et, 0x21 followed by bt 0x2FFF91FFE (the movabs at @0x130aa9e2). 0x2FFF91FFE is a 34-bit mask over the PrimitiveType ordinal selecting the numeric/packed types the bundle model can price; a tuple/token/opaque type (outside the mask) falls straight through to an empty (all-zero) ResourceVector. Two secondary tests widen the gate: (et & 0xFFFFFFFE) == 0x20 admits the F8 pair (ordinals 0x20/0x21), and bt 0x400048000 over et ≤ 0x22 is the loop-fusion / output-fusion peel mask that lets a fusion root through even when its own element type is structural.
GOTCHA — the same
0x2FFF91FFEnumeric mask and the same0x400048000peel mask reappear verbatim insideRecordHloCycles(@0x130bc03d). They are not the same decision: the copy inGetHloResourcesImplgates routing (does this op reach the conv/fusion arms at all), while the copy inRecordHloCyclesgates the leaf deposit (whether the op is recorded or short-circuited). A reimplementation must apply the mask at both layers; applying it only at the top lets structural ops slip into the leaf switch, and applying it only at the leaf mis-routes numeric ops at the top.
The conv/reduce-window selection (@0x130aaa79). ExtractConvLikeHlo pulls the conv-like root out of the (possibly fused) instruction. If that root is a reduce-window (opcode 94 = 0x5e), GetReduceWindowType classifies its axis. The decompile branches on t != -1 && t != 2: a reduce-window whose type is -1 (not conv-lowerable) or 2 (major-axis / max-pool) is not priced as a convolution — it falls through to the default loop path. Only a lane-axis (0) or sublane-axis (1) reduce-window that is also IsConvLowerable reaches GetOutputFusionOrConvolutionCycles.
NOTE — the type
-1/2sentinel here is the dispatch filter that skips MXU pricing; it is not the leaf reduce-window emitter. When a reduce-window is conv-priced it later reachesRecordReduceWindowCycles, where all three axis types (lane/sublane/major) deposit. The upstream skip-logic must not be carried into the leaf. The full pooling cost is on Reduce-Window / Pooling Cost; the conv state it shares is on ConvolutionCostState.
The collective predicate and its fusion peel (@0x130aa580, lines following the first IsSupportedCollectiveHlo). A bare collective is routed to GetCollectiveCycles immediately. When the op is a fusion (opcode 61 = 0x3d), the dispatch walks the fused computation's instruction list looking for a collective HLO inside (IsSupportedCollectiveHlo over each fused instr) and, if found and the op is not a collective-compute fusion, prices that inner collective through GetCollectiveCycles. This is the one arm that re-points the priced op away from the top-level instruction.
Function Map
| Function | Address | Role |
|---|---|---|
CostModel::GetHloResources | 0x130aa560 | public wrapper |
CostModel::GetHloResourcesImpl | 0x130aa580 | 5-way routing dispatch |
IsSupportedCollectiveHlo | 0x130aeda0 | collective predicate (arm 1) |
GetCollectiveCycles | 0x130abfc0 | network-model deposit (rv[+8]) |
IsFusionSupportedHlo | 0x130abee0 | numeric + fusion-support gate |
IsConvLowerable | 0x14553620 | conv-lowerable predicate |
ExtractConvLikeHlo | 0x1d6aa140 | pull the conv/reduce-window root |
GetReduceWindowType | 0x1454d4a0 | −1/2 max-pool sentinel (axis class) |
GetOutputFusionOrConvolutionCycles | 0x130aede0 | arm 2 — MXU output-fusion model |
IsCollectiveComputeFusion | 0x13e028c0 | arm-3 predicate |
GetCollectiveComputeFusionCycles | 0x130b13a0 | arm 3 |
GetLoopFusionOrUnfusedHloCycles | 0x130b2bc0 | default → Stage 2 |
Stage 2 — RecordCyclesIfFused, the Fusion-Peel Router
Purpose
GetLoopFusionOrUnfusedHloCycles composes the software-pipelined prologue/steady/tail bundle cost (see Bundle-Aware Cost) and calls into the per-op deposit machinery. For a fused region that machinery is RecordCyclesIfFused: a router that recognizes the fusion root's kind and dispatches to the dedicated heavy emitter, peeling the structural fusions before any leaf op is priced. An unfused op reaches the leaf RecordHloCycles directly.
Entry Point
RecordCyclesIfFused (@0x130cc720) ── fusion-root kind router
├─ IsLoopFusion? → RecordLoopFusionCycles @0x130b89a0 (recurse over fused leaves)
├─ IsConvLowerable? ── builds shared ConvCostState, then:
│ ├─ IsOutputFusion? → RecordOutputFusionCycles @0x130b86c0 (MXU output-fusion)
│ ├─ opcode == 0x2b? → RecordConvolutionCycles @0x130ca6c0 (conv → Matmul/Matpush/Xlu)
│ └─ (else, CHECK 0x5e) → RecordReduceWindowCycles @0x130c94e0 (pooling → VectorLoad/AluAny/Xlu)
└─ (not conv-lowerable) → RecordHloCycles @0x130bbfe0 (leaf per-op deposit)
Algorithm
Status RecordCyclesIfFused(inst, fs, window, rv, isFused, nesting): // sub_130CC720
if IsLoopFusion(inst): // line 188
return RecordLoopFusionCycles(inst, window, rv, …, nesting+1); // sub_130B89A0
if IsConvLowerable(inst): // line 256 — gates the window arms
convCostState = GetConvolutionCostState(inst); // + CalculateNestedConvolutionWindows
if IsOutputFusion(inst): // line 434
return RecordOutputFusionCycles(inst, &convCostState, …, nesting+1); // sub_130B86C0
if inst.opcode == 0x2b (convolution): // line 513
return RecordConvolutionCycles(inst, &convCostState, rv, nesting+1); // sub_130CA6C0
// else: CHECK(opcode == kReduceWindow) @line 526 (cost_model.cc:6236)
return RecordReduceWindowCycles(inst, …); // sub_130C94E0 — line 536
// not conv-lowerable: drop internal producer edges, then the leaf deposit
return RecordHloCycles(inst, window, rv, fs, nesting+1); // sub_130BBFE0 — line 706
The peel order is fixed and confirmed at the five call sites. IsLoopFusion is tested first (it recurses over its constituent leaves). The remaining heavy arms — output-fusion, then conv (0x2b), then reduce-window (0x5e) — are nested inside a single IsConvLowerable(inst) gate (line 256): when that gate holds, the router materializes a ConvCostState (via GetConvolutionCostState / CalculateNestedConvolutionWindows) shared by all three window arms, dispatched in that order. Only a non-conv-lowerable, non-loop-fusion op reaches the leaf fallthrough. The reduce-window arm is the else of the conv test and guards itself with CHECK(hlo_to_fuse->opcode() == kReduceWindow) (MakeCheckOpString @line 526, cost_model.cc:6236), so a mis-classified peel aborts rather than mis-prices.
NOTE — the leaf
RecordHloCyclesitselfCHECKs!hlo->IsLoopFusion() && !hlo->IsOutputFusion()(cost_model.cc:6609,@0x130bbfe0line 128). A fusion root must never reach the leaf — it isRecordCyclesIfFused's job to peel it first. A reimplementation that calls the leaf on a fusion root trips this assertion. The peel router is the only legal path from a fusion to a deposit.
Edge Internalisation — IsProducerUse
Before the leaf deposit of a fused parameter, RecordHloCycles walks the fusion's operands and consults IsProducerUse (@0x130ab0c0) per operand edge. A producer→consumer edge that is internal to the fusion has its input-DMA dropped — the bytes never leave the fusion, so they cost no MemXfer cycles. This is the bundle-side mirror of the fusion-priority model's edge internalisation: only the fusion's external inputs are charged as input transfers. The walk runs in the parameter arm of the leaf switch (below), not in RecordCyclesIfFused.
Function Map
| Function | Address | Role |
|---|---|---|
CostModel::RecordCyclesIfFused | 0x130cc720 | fusion-root kind router |
CostModel::RecordLoopFusionCycles | 0x130b89a0 | loop-fusion recursion |
CostModel::RecordOutputFusionCycles | 0x130b86c0 | output-fusion (MXU) |
CostModel::RecordConvolutionCycles | 0x130ca6c0 | conv peel → MXU deposit |
CostModel::RecordReduceWindowCycles | 0x130c94e0 | reduce-window peel → vector deposit |
CostModel::RecordHloCycles | 0x130bbfe0 | leaf per-op deposit |
CostModel::IsProducerUse | 0x130ab0c0 | producer→consumer edge-drop |
CostModel::RecordFusionInputCycles | 0x130ce940 | external-input DMA → MemXfer |
Stage 3 — RecordHloCycles, the Leaf Opcode→Slot Routing
Purpose
RecordHloCycles is the leaf per-op deposit: given one HLO instruction, the bundle Window, the output ResourceVector, and a nesting depth, it deposits that op's throughput cycles into the named Resource slot(s) the op occupies. The op→block decision is a switch over the opcode byte (*(byte*)(a2+12)), compiled to a 0x7f-entry self-relative jump table at .rodata 0xae0ebbc. The dispatch index is idx = opcode − 3 (add eax, 0xFFFFFFFD at @0x130bc0db), and ja > 0x7e routes to the default block.
Entry Point
RecordHloCycles (@0x130bbfe0)
├─ CHECK(!IsLoopFusion && !IsOutputFusion) cost_model.cc:6609
├─ numeric-mask + IsLoopFusion/IsOutputFusion peel pre-check @0x130bc03d
├─ ElementIsFloating(shape) ── the add/subtract lane gate
├─ element_count = Product(output dims) ── xla::Product → [rbp-0x30]
├─ RecordHloCyclesIfTopLevel @0x130cdd80 ── == 1 guard, else skip the switch
└─ switch(opcode) → jump table @ .rodata 0xae0ebbc
The Deposit Mechanic
Every named block resolves a slot and a per-gen throughput, then accumulates:
// per CT::Instruction k issued by the block:
slot = CycleTable::GetResource(k); // sub_1C89CE20 = dword_B438AEC[k]
thru = cycletable.vtable[+0x10](k); // GetCyclesForThroughput(k), per-gen
ResourceVector::Acc(slot, element_count × thru × W);// sub_1C89ADC0, W = fixed FP multiplier
element_count is Product(output dims). GetResource is a single gen-invariant flat lookup — dword_B438AEC[k] (4 B/entry, .rodata 0xb438aec) — re-decoded below and identical to the table on Resource Enum. GetCyclesForThroughput is the per-gen integer (Per-Opcode Cycle Constants). Acc (@0x1c89adc0) bounds-checks slot < 0x17 (23) with a trapping ud1 and does vector[slot] += cycles.
QUIRK — for a float
add/subtractthe block resolves the slot viaGetResource(CT)(→VectorAlu1), but for an integeradd/subtractthe slot is hard-coded to5(VectorAluAny) while the throughput is stillthru(CT 0x12)/thru(CT 0x13). The slot and the throughput are decoupled: the float gate redirects only the destination lane, not the cycle count. A reimplementation that gates the throughput on the type (rather than the lane) mis-prices integer arithmetic.
Algorithm
RecordHloCycles(inst, window, rv, fs, nesting): // sub_130BBFE0
CHECK(!inst.IsLoopFusion() && !inst.IsOutputFusion()); // line 128
isFloat = ShapeUtil::ElementIsFloating(inst.shape); // line 139
elems = Product(output_dims); // line 145 → [rbp-0x30]
if RecordHloCyclesIfTopLevel(...) != 1: return status; // line 147 (fusion-root guard)
switch (*(byte*)(inst + 12)): // jump table @ .rodata 0xae0ebbc
case 0x03 add: slot = isFloat ? GetResource(CT 0x12) : 5; // R4 / R5
Acc(slot, elems × thru(CT 0x12));
case 0x7b subtract: slot = isFloat ? GetResource(CT 0x13) : 5; // R4 / R5
Acc(slot, elems × thru(CT 0x13));
case 0x4b multiply: Acc(GetResource(CT 0x14) /*R3*/, elems × thru(CT 0x14));
case 0x32 divide: /* 4-deposit reciprocal+mul micro-sequence */
Acc(6 /*R6 VectorEup*/, elems × thru(CT 0x18));
Acc(GetResource(CT 0x14) /*R3*/, (elems × thru(CT 0x14)) × 3.0); // 0xa2df930
Acc(GetResource(CT 0x12) /*R4*/, (elems×2) × thru(CT 0x12));
Acc(5 /*R5*/, elems × 9.0); // 0xa2deb40
case 0x47 logistic: /* 5-deposit sigmoid micro-sequence */
Acc(GetResource(CT 0x12) /*R4*/, elems × thru(CT 0x12));
Acc(GetResource(CT 0x14) /*…*/, (elems×2) × thru(CT 0x14));
Acc(5 /*R5*/, elems);
Acc(6 /*R6 VectorEup*/, … via CT 0x1a);
case 0x38 erf: if (inst.vtable[+312](11)): // ext-precision predicate
Acc(6 /*R6 VectorEup*/, elems × thru(CT 0x11)); // single deposit
else: /* 4-deposit polynomial */
Acc(6 /*R6*/, elems × thru(CT 0x18));
Acc(GetResource(CT 0x14) /*R3*/, (elems × thru(CT 0x14)) × 16.0); // 0xa2df040
Acc(GetResource(CT 0x12) /*R4*/, (elems×2) × thru(CT 0x12));
Acc(5 /*R5*/, elems × 4.0); // 0xa2de830
case 0x2a convert: if ElementHasBitWidth(inst.shape, 1): // 1-bit PRED
Acc(5, elems × 2.0); // packed-bool repack
else: /* fall to ZERO — numeric width-conversion free */
case 0x6c select: Acc(5, elems × 2.0); // read both branches + pred
case 0x52 parameter: if !IsFused(inst): /* not deposited here */
else: walk operands, IsProducerUse edge-drop,
RecordCyclesIfFused per producer, RecordFusionInputCycles → MemXfer;
case 0x5b reduce: if IsFused(inst): Acc(5, elems);
else: w = GetInputWindow(inst); Acc(5, Product(w)); // operand window
case 0x18,0x1a,0x27,0x29,0x43,0x61,0x81: /* ZERO — no deposit */
default: Acc(5 /*R5 VectorAluAny*/, elems × 1.0);
The Decoded Jump Table
Block addresses and the jump-table target offsets are byte-verified from .rodata 0xae0ebbc (target = 0xae0ebbc + offset[opcode−3]). The full µop-sequence detail for divide/logistic/erf (the FP-multiplier scaling, the slow-vs-fast erf gate) is documented on TpuHloCostAnalysis; this table is the routing-spine view — which block, which CT, which slot.
| Opcode (name) | Block @ | CT issued → slot | Cycle quantity |
|---|---|---|---|
0x03 add | 0x130bc0f7 | float: CT 0x12→R4; int: R5 | elems × thru(0x12) |
0x7b subtract | 0x130bc4af | float: CT 0x13→R4; int: R5 | elems × thru(0x13) |
0x4b multiply | 0x130bc4dd | CT 0x14→R3 | elems × thru(0x14) |
0x32 divide | 0x130bc3cf | R6 + CT 0x14→R3 + CT 0x12→R4 + R5 | 4-deposit; ×3.0(R3), ×2(R4 elems), ×9.0(R5) |
0x47 logistic | 0x130bc17b | CT 0x12→R4, CT 0x14, R5, CT 0x1a→R6 | 5-deposit sigmoid; ×2(R-elems) |
0x38 erf | 0x130bc245 | fast: CT 0x11→R6; slow: R6+R3+R4+R5 | gated; slow ×16.0(R3), ×2(R4), ×4.0(R5) |
0x2a convert | 0x130bc293 | 1-bit: R5; wider: none | 1-bit: elems × 2.0; wider: 0 |
0x6c select | 0x130bc2a9 | R5 (×2 path) | elems × 2.0 |
0x52 parameter | 0x130bc2b8 | fused→RecordFusionInputCycles; else none | fused: input-DMA into MemXfer R9..12 |
0x5b reduce | 0x130bc2f9 | R5 over operand window | priced over the reduced-OVER input window |
| DEFAULT (most ops) | 0x130bc3c0 | R5 VectorAluAny | elems × 1.0 |
| ZERO-cost | 0x130bc8c2 | (no deposit) | 0 |
The DEFAULT arm covers every numeric elementwise / structural op not named above (the decompiler annotates it cases 4-23, 25, 27-38, 40, 43-49, 51-55, 57-66, 68-70, 72-74, 76-81, 83-90, 92-96, 98-107, 109-122, 124-128). The ZERO arm covers the data-layout ops: bitcast (0x18), broadcast (0x1a), concatenate (0x27), constant (0x29), iota (0x43), reshape (0x61), tuple (0x81) — all at the single block 0x130bc8c2.
NOTE — The
erffast path carries noMatpushtraffic: the ext-precision predicateinst.vtable[+312](11)(@0x130bc245) selects a singleVectorEupdeposit atCT 0x11, whichGetResourcemaps toR[6](dword_B438AEC[0x11] == 6, byte-verified). It stays entirely on the vector/EUP pipes — see TpuHloCostAnalysis.
The Op→Slot Table — dword_B438AEC
CycleTable::GetResource(k) (@0x1c89ce20) is literally return dword_B438AEC[k]. The entries the leaf blocks issue, re-decoded from the binary:
CT::Instruction | dword_B438AEC[CT] → slot | Issued by |
|---|---|---|
0x11 | R[6] VectorEup | erf fast path |
0x12 | R[4] VectorAlu1 | float add, divide, logistic |
0x13 | R[4] VectorAlu1 | float subtract |
0x14 | R[3] VectorAlu0 | multiply, divide, logistic, erf-slow |
0x18 | R[6] VectorEup | divide / erf-slow transpose stage |
0x1a | R[6] VectorEup | logistic lane-compare |
QUIRK —
CT 0x12andCT 0x13both map to the same slotR[4]VectorAlu1, even though they are distinctCycleTable::Instructionordinals (input-rotate vs output-rotate) with distinct per-gen throughputs. So float add and float subtract share a dedicated lane but may not share a cycle count. Multiply pins the other dedicated lane (R[3]viaCT 0x14). This is the deliberate port-balancing input toMaxResourceCycles— see Considerations.
Function Map
| Function | Address | Role |
|---|---|---|
CostModel::RecordHloCycles | 0x130bbfe0 | leaf per-op deposit + opcode switch |
CostModel::RecordHloCyclesIfTopLevel | 0x130cdd80 | == 1 fusion-root guard |
CycleTable::GetResource | 0x1c89ce20 | dword_B438AEC[CT] (gen-invariant) |
ResourceVector::Acc | 0x1c89adc0 | vector[slot] += cycles (bound < 23) |
ShapeUtil::ElementIsFloating | (call @line 139) | add/subtract lane gate |
ShapeUtil::ElementHasBitWidth | 0x20ce1580 | 1-bit PRED convert gate |
xla::Product | 0x20cf5200 | element_count = Product(output dims) |
GetInputWindow | (anon-ns) | non-fused reduce operand window |
Fixed FP Multipliers (.rodata, byte-verified)
| Address | Value | Used by |
|---|---|---|
0xa2df230 | 1.0 | default per-op multiplier |
0xa2df930 | 3.0 | divide VectorAlu0 (R3) scale |
0xa2deb40 | 9.0 | divide VectorAluAny (R5) scale |
0xa2df040 | 16.0 | erf slow-path VectorAlu0 (R3) scale |
0xa2de830 | 4.0 | erf slow-path VectorAluAny (R5) scale |
0xa2df5c8 | 0.5 | MaxResourceCycles VectorAlu port-balance blend |
Considerations — How the Routing Feeds the Reduction
The leaf op→slot routing across the three VectorAlu slots is the port-balancing input to the MaxResourceCycles 0.5-blend group {R3, R4, R5} (see Resource Enum). The routing is deliberate: multiply pins VectorAlu0 (R3), floating add/subtract pin VectorAlu1 (R4), and the catch-all DEFAULT plus the ×2 ops (select, 1-bit convert) plus integer add/subtract land on VectorAluAny (R5), the lane the blend redistributes. A fusion mixing multiplies, floating adds, and "any" ops therefore fills both dedicated lanes and the bundle cost is the balanced max, not the serial sum.
The heavy emitters reached by Stages 1 and 2 deposit into other slot groups. RecordConvolutionCycles (conv peel) deposits into Matmul / Matpush / Xlu (R0/R1/R2, the plain-MAX MXU group — see ConvolutionCostState); RecordReduceWindowCycles (pooling) deposits into VectorLoad / VectorAluAny / Xlu (R7/R5/R2 — see Reduce-Window / Pooling Cost); GetCollectiveCycles writes a scalar cycle count into rv[+8] for the ICI links. Each routing arm therefore lands on the slot group whose reduction rule matches the op's hardware behavior — MXU pipes overlap (plain MAX), vector ALUs port-balance (0.5 blend), memory transfers serialize (sum).
parameter and reduce route out of the ALU lanes. A fused parameter is an external-input DMA priced by RecordFusionInputCycles (@0x130ce940) into the MemXfer slots R9..12, after IsProducerUse has dropped the internal producer edges; whether VMEM-resident params instead route to R7 VectorLoad per memory space is not byte-confirmed (MEDIUM). A non-fused reduce is priced over its operand window via GetInputWindow — cost scales with the large reduced-over tensor, not the small output — the bundle-side mirror of HandleReduce's ExtentProduct(operand) flop formula.
QUIRK — the bundle model prices a
broadcast(0x1a) at zero in the leaf (the ZERO-cost block), even though the fusion-priority model may charge a cross-lane-movement weight for the same op. The two surfaces disagree on layout ops by design: the bundle path treats a broadcast as a free register splat (no functional-unit occupancy), while the priority model accounts for the data movement. Do not unify them. See NormalizedComputationCost.
Worked Example — Routing One Fusion Through All Three Stages
A loop-fusion body { multiply [256,128] (f32), add [256,128] (f32), tanh [256,128] } is priced top to bottom:
GetHloResourcesImpl(fusion) Stage 1
IsSupportedCollectiveHlo → no
et = F32 (in 0x2FFF91FFE mask) → numeric gate OK
IsFusionSupportedHlo → yes; ExtractConvLikeHlo → null (no conv)
IsConvLowerable → no; IsCollectiveComputeFusion → no
→ GetLoopFusionOrUnfusedHloCycles (default arm)
GetLoopFusionOrUnfusedHloCycles → RecordCyclesIfFused(fusion) Stage 2
IsLoopFusion → yes → RecordLoopFusionCycles → recurse over leaves:
each leaf → RecordCyclesIfFused → (not a fusion/conv/rw) → RecordHloCycles
RecordHloCycles per leaf, element_count = 256·128 = 32768 Stage 3
multiply (0x4b) → CT 0x14 → GetResource → R3 VectorAlu0: Acc(R3, 32768 × thru(0x14))
add (0x03), f32 → CT 0x12 → GetResource → R4 VectorAlu1: Acc(R4, 32768 × thru(0x12))
tanh (0x7d) → DEFAULT block → R5 VectorAluAny: Acc(R5, 32768 × 1.0)
MaxResourceCycles' 0.5-blend on {R3, R4, R5} runs the multiply (VectorAlu0) and add (VectorAlu1) lanes in parallel, load-balances the tanh "any" work into the less-busy lane, and overlaps the residual at 50% — so the bundle cost is ≈ max(R3, R4) + half the tanh residual, not the serial sum. Had add been integer-typed, it would have joined tanh on R5 (the slot hard-coded to 5 when ElementIsFloating is false) rather than pinning R4. The flop side (run by a separate pass) classifies tanh as transcendental (Properties+0x54) and multiply/add as flops (Properties+0x50); the two surfaces agree on which ops are expensive but never share a number.
Related Components
| Component | Relationship |
|---|---|
| TpuHloCostAnalysis | the flop/byte override surface that runs alongside this routing; owns the full RecordHloCycles µop-sequence detail and the flop-override table |
| Resource Enum (23-slot) | the ResourceVector slots, Acc, GetResource, and the MaxResourceCycles reduction the deposits feed |
| ConvolutionCostState | the conv peel emitter (RecordConvolutionCycles) reached from Stages 1/2 |
| Reduce-Window / Pooling Cost | the reduce-window peel emitter and the GetReduceWindowType axis classifier |
| Bundle-Aware Cost | the software-pipelined loop composition over the per-op deposits |
| NormalizedComputationCost | the fusion-priority surface that prices the same HLO differently |
Cross-References
- TpuHloCostAnalysis — the override surface that feeds into this routing spine; the flop/byte
Propertiesmodel and the per-op µop-sequence breakdown - Resource Enum (23-slot) — the
Resourceslot names,Acc,GetResourceop→slot table, and theMaxResourceCyclesoverlap model - ConvolutionCostState —
RecordConvolutionCycles/RecordConvKernelCycles, the MXU deposit reached from the conv arm - Reduce-Window / Pooling Cost —
RecordReduceWindowCyclesand the lane/sublane/major axis dispatch - WindowDescription Byte-Cost — the conv/DMA byte+throughput primitive behind the heavy emitters
- NormalizedComputationCost — the opcode→weight fusion-priority model that prices layout ops differently from the bundle path
- Bundle-Aware Cost —
GetLoopFusionOrUnfusedHloCyclesand the prologue/steady/tail loop cost over the per-op deposits - Per-Opcode Cycle Constants — the per-gen
GetCyclesForThroughput(CT::Instruction)integers each deposit multiplies by - Cost Model Overview — where this routing sits among the cost-model class families