TpuHloCostAnalysis
Addresses apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (BuildID md589edbbe81c5b328a958fe628a9f2207d). The binary is not stripped — every symbol below is a demangled C++ name..text/.rodataVMA == file offset;.data.rel.roVMA − 0x200000 == file offset. Other versions differ.
Abstract
xla::jellyfish::TpuHloCostAnalysis is the HLO-level flop/byte/transcendental cost model. It is a thin subclass of the generic XLA xla::HloCostAnalysis (vtable @ 0x218fb618, ctor @ 0x130a1620): it inherits every per-opcode emitter unchanged and overrides exactly three — HandleGather, HandleScatter, HandleCustomCall. Each Handle* walks one HLO instruction and writes three float cost properties into a per-instruction Properties flat-hash-map: flop count (+0x50), transcendental-op count (+0x54), and bytes accessed (+0x58). The familiar reference frame is upstream XLA's HloCostAnalysis exactly — same kFlopsKey / kTranscendentalsKey / kBytesAccessedKey properties, same 2·M·N·K dot model — with a TPU-specific patch on the three ops whose silicon cost the generic model gets wrong: gather and scatter (priced as chunk-granule DMA, not flops) and custom-call (priced by a target-keyed registry).
These flop properties are one of two cost surfaces the TPU compiler builds over the same HLO, and the page must not conflate them. The Properties flops here are the fusion-priority input: flop_count(inst) (@ 0x1e4841e0) reads Properties[inst].float[+0x50] and feeds the convolution-cycle estimate that NormalizedComputationCost caches per unique_id. The other surface is the bundle-occupancy cost — CostModel::RecordHloCycles (@ 0x130bbfe0) deposits per-op throughput cycles into the 23-slot ResourceVector routed by an opcode jump table — reached through GetHloResourcesImpl (@ 0x130aa580, see GetHloResources Routing). The two share the same HLO and agree on which ops are expensive, but they are computed by different functions, store into different structures, and never share a number. This page documents the flop/byte Properties model and the RecordHloCycles opcode→slot routing that runs alongside it.
The third strand is the routing decision that selects which cost sub-emitter prices an op. GetHloResourcesImpl is a five-way dispatch: collective → network model; conv-lowerable / reduce-window → MXU output-fusion model; collective-compute fusion → its own model; else → the loop/elementwise path that ultimately calls RecordHloCycles. The page covers the override table, the RecordHloCycles opcode jump table, and the base flop formulas the overrides leave untouched.
For reimplementation, the contract is:
- The subclass shape: which three
Handle*are overridden, the+0x50/+0x54/+0x58Propertieslayout, and themodel_tpu_specific_overheadsgate (this+456) that turns the overrides on. - The base
HloCostAnalysisflop formulas the TPU model inherits verbatim: dot, convolution (divides by both feature- and batch-group counts), reduce, reduce-window, and the elementwise flop-vs-transcendental classification. - The three TPU overrides: gather/scatter chunk-ratio bytes model, scatter's added combiner flops, and the custom-call
FunctionRegistrykeyed oncustom_call_target. - The
RecordHloCyclesopcode→ResourceVector-slot jump table: the per-opCT::Instructionissued, the named slot it lands in, the cycle quantity, and the float-type gate that scatters add/subtract across the dedicated vector-ALU lanes.
| Class | xla::jellyfish::TpuHloCostAnalysis : xla::HloCostAnalysis |
| Vtable / ctor | 0x218fb618 / 0x130a1620 |
| Overridden methods | HandleGather 0x130a2de0, HandleScatter 0x130a3160, HandleCustomCall 0x130a35c0 |
| Properties fields | +0x50 flops, +0x54 transcendentals, +0x58 bytes-accessed (all float) |
| Flop reader | HloCostAnalysis::flop_count 0x1e4841e0 → Properties[inst].float[+0x50] |
| Overrides gate | *(byte*)(this+456) = model_tpu_specific_overheads; off → tail-call base |
| Routing dispatch | CostModel::GetHloResourcesImpl 0x130aa580 (5-way) |
| Bundle deposit | CostModel::RecordHloCycles 0x130bbfe0 + jump table @ .rodata 0xae0ebbc |
| Source file | platforms/xla/service/jellyfish/tpu_hlo_cost_analysis.cc |
GetHloResourcesImpl — the Five-Way Routing Dispatch
Purpose
GetHloResourcesImpl is the front door that decides which cost sub-model prices an HLO op before it is deposited into the bundle's ResourceVector. It is not the flop path (that is the Properties/Handle* side); it is the resource-routing path that picks between the collective-network model, the MXU conv/output-fusion model, the collective-compute-fusion model, and the loop/elementwise default that reaches RecordHloCycles.
Entry Point
GetHloResourcesImpl (0x130aa580) ── StatusOr<ResourceVector>
├─ IsSupportedCollectiveHlo (0x130aeda0) ── collective? → GetCollectiveCycles (0x130abfc0)
├─ IsFusionSupportedHlo (0x130abee0) ── numeric-type + fusion gate
├─ IsConvLowerable (0x14553620) ┐
├─ ExtractConvLikeHlo (0x1d6aa140) ├ conv / reduce-window selection
├─ GetReduceWindowType (0x1454d4a0) ┘ (type −1/2 max-pool → not conv)
│ → GetOutputFusionOrConvolutionCycles (0x130aede0)
├─ IsCollectiveComputeFusion(0x13e028c0) → GetCollectiveComputeFusionCycles (0x130b13a0)
└─ (default) → GetLoopFusionOrUnfusedHloCycles (0x130b2bc0)
Algorithm
StatusOr<ResourceVector> GetHloResourcesImpl(inst, opts, fs, isFused): // sub_130AA580
op = inst.opcode; // byte [inst+0xc]
// (1) COLLECTIVE — all-reduce / all-gather / … priced by the network path.
if IsSupportedCollectiveHlo(inst): // sub_130AEDA0
// (a fusion of opcode 0x3d that wraps a collective is also peeled here)
return GetCollectiveCycles(inst, &rv); // sub_130ABFC0 — rv[+8] = scalar cycles
// (2) NUMERIC-TYPE + FUSION-SUPPORT GATE
et = inst.shape.element_type; // [inst+0x58]->[+0]
if et <= 0x21 and (0x2FFF91FFE >> et)&1 ... // numeric/packed-type mask (bittest64)
and IsFusionSupportedHlo(inst, opts.target): // sub_130ABEE0
// (3) CONV / REDUCE-WINDOW MAIN-OP SELECTION
convlowerable = IsConvLowerable(inst); // sub_14553620
conv = ExtractConvLikeHlo(inst); // sub_1D6AA140
if conv and conv.opcode == 0x5e: // reduce-window (0x5e == 94)
t = GetReduceWindowType(conv); // sub_1454D4A0
if t == -1 or t == 2: // max-pool / unknown — NOT conv-priced
goto default_path;
if convlowerable:
return GetOutputFusionOrConvolutionCycles(inst, opts, fs); // sub_130AEDE0
// (4) COLLECTIVE-COMPUTE FUSION (collective fused with compute)
if IsCollectiveComputeFusion(inst): // sub_13E028C0
return GetCollectiveComputeFusionCycles(inst, opts, fs); // sub_130B13A0
default_path:
// (5) DEFAULT: loop / elementwise / unfused op — the per-op deposit path
return GetLoopFusionOrUnfusedHloCycles(inst, opts, fs); // sub_130B2BC0
The numeric gate at 0x130aa9dc is cmp et, 0x21 followed by bt 0x2FFF91FFE against the element-type ordinal — a bitmask of the numeric/packed element types that the bundle model can price (non-numeric tuple/token/opaque types fall straight through). The reduce-window sentinel is the one non-obvious arm: a reduce-window whose GetReduceWindowType is -1 (unknown) or 2 (max-pool) is not a lowerable convolution, so it is priced as a plain loop op rather than through the MXU output-fusion estimator.
NOTE —
GetLoopFusionOrUnfusedHloCyclesis the only arm that reachesRecordHloCycles. For a fused region it goes throughRecordCyclesIfFused(@0x130cc720), which first peels convolution (opcode0x2b→RecordConvolutionCycles@0x130ca6c0), reduce-window (0x5e→RecordReduceWindowCycles@0x130c94e0), loop-fusion and output-fusion, and only then falls through toRecordHloCyclesper leaf op.IsProducerUse(@0x130ab0c0) drops a producer→consumer edge's input DMA before deposit. The window-iteration bodies of the conv / output-fusion emitters are out of scope here (see Bundle-Aware Cost); this page covers the leaf opcode→slot routing.
Function Map
| Function | Address | Role |
|---|---|---|
CostModel::GetHloResourcesImpl | 0x130aa580 | 5-way routing dispatch |
IsSupportedCollectiveHlo | 0x130aeda0 | Collective predicate (arm 1) |
IsFusionSupportedHlo | 0x130abee0 | Numeric + fusion-support gate |
IsConvLowerable | 0x14553620 | Conv-lowerable predicate |
ExtractConvLikeHlo | 0x1d6aa140 | Pull the conv/reduce-window root |
GetReduceWindowType | 0x1454d4a0 | −1/2 max-pool sentinel |
GetCollectiveComputeFusionCycles | 0x130b13a0 | Arm 4 |
GetLoopFusionOrUnfusedHloCycles | 0x130b2bc0 | Default → RecordHloCycles |
RecordHloCycles — the Opcode→Slot Jump Table
Purpose
RecordHloCycles is the leaf per-op deposit: given one HLO instruction, an output element count, and the bundle's ResourceVector, it deposits that op's throughput cycles into the named Resource slot(s) the op occupies. The op→slot decision is a switch over the opcode compiled to a 0x7f-entry self-relative jump table at .rodata 0xae0ebbc (index = opcode − 3, ja > 0x7e → default).
Each deposit issues one or more CT::Instruction (the ~33-bucket LLO collapse), resolves the slot via CycleTable::GetResource(k) (@ 0x1c89ce20, gen-invariant table @ 0xb438aec; see Resource Enum), reads the per-gen throughput via the cycle-table vtable+0x10 (GetCyclesForThroughput(k), per-gen, see Per-Opcode Cycle Constants), and accumulates Acc(slot, element_count × throughput × W) (ResourceVector::Acc @ 0x1c89adc0) where W is a fixed FP multiplier read from .rodata.
Algorithm
RecordHloCycles(inst, window, rv, fs, nesting): // sub_130BBFE0
elems = Product(output_dims); // xla::Product → [rbp-0x30]
isFloat = ShapeUtil::ElementIsFloating(inst.shape); // sub @0x130bc... line 139
if RecordHloCyclesIfTopLevel(...) != 1: return; // sub_130CDD80 (fusion-root guard)
switch (inst.opcode): // jump table @ .rodata 0xae0ebbc
case 0x03 add: slot = isFloat ? GetResource(CT 0x12) : R5; // R4 / R5
Acc(slot, elems × thru(CT 0x12));
case 0x4b multiply: Acc(GetResource(CT 0x14), elems × thru(CT 0x14)); // R3
case 0x7b subtract: slot = isFloat ? GetResource(CT 0x13) : R5; // R4 / R5
Acc(slot, elems × thru(CT 0x13));
case 0x32 divide: /* 4-deposit reciprocal+mul micro-sequence */
Acc(R6, elems × thru(CT 0x18)); // VectorEup
Acc(GetResource(CT 0x14)=R3, elems × thru(CT 0x14) × 3.0);
Acc(GetResource(CT 0x12)=R4, (elems×2) × thru(CT 0x12));
Acc(R5, elems × 9.0); // VectorAluAny
case 0x47 logistic: /* sigmoid micro-sequence */
Acc(GetResource(CT 0x12)=R4, elems × thru(CT 0x12));
Acc(GetResource(CT 0x14)=R3, (elems×2) × thru(CT 0x14)); // VectorAlu0, same slot as multiply
Acc(R6, elems × thru(CT 0x1a)); // VectorEup, lane-cmp
case 0x38 erf: if isExtPrecPath(inst): // vtable+312(11)
Acc(R6, elems × thru(CT 0x11)); // single VectorEup deposit
else: // 4-deposit polynomial
Acc(R6, elems × thru(CT 0x18));
Acc(GetResource(CT 0x14)=R3, elems × thru(CT 0x14) × 16.0);
Acc(GetResource(CT 0x12)=R4, (elems×2) × thru(CT 0x12));
Acc(R5, elems × 4.0);
case 0x2a convert: if ElementHasBitWidth(inst.shape, 1): // 1-bit PRED
Acc(R5, elems × 2.0); // packed-bool repack
else: /* no deposit — free */
case 0x6c select: Acc(R5, elems × 2.0); // read both branches + pred
case 0x52 parameter: if IsFused(inst): /* RecordFusionInputCycles → MemXfer */
else: /* no deposit — non-fused param is free */
case 0x5b reduce: if IsFused(inst): Acc(R5, elems);
else: w = GetInputWindow(inst); Acc(R5, Product(w)); // operand window
case 0x18,0x1a,0x27,0x29,0x43,0x61,0x81: /* ZERO — no deposit */
default: Acc(R5, elems × 1.0); // VectorAluAny, 1 cy/elem
GOTCHA — Two slot-routing traps in this table. (1)
addandsubtractare float-type-gated:slot = ElementIsFloating(shape) ? GetResource(CT) : R5. Integer add/subtract deposit intoVectorAluAny(R5) like the default; only floating-point add/subtract reach the dedicated lanes (R4viaCT 0x12/0x13). Ignoring the gate over-occupies the dedicated lanes for integer fusions. (2)convertdeposits only for 1-bit PRED (R5,elems × 2.0, a packed-bool repack); every wider numeric width-conversion falls through to the zero path and is free — the opposite of the intuitive "narrow free, wide costed" assumption.
The Decoded Jump Table
element_count is Product(output dims); thru(k) is the per-gen GetCyclesForThroughput(CT::Instruction k); W is the fixed FP multiplier. CT-bucket→slot via CycleTable::GetResource.
| Opcode (name) | Block @ | CT issued → slot | Cycle quantity |
|---|---|---|---|
0x03 add | 0x130bc0f7 | float: CT 0x12→R4; int: R5 | elems × thru(0x12) |
0x4b multiply | 0x130bc4dd | CT 0x14→R3 | elems × thru(0x14) |
0x7b subtract | 0x130bc4af | float: CT 0x13→R4; int: R5 | elems × thru(0x13) |
0x32 divide | 0x130bc3cf | R6 + CT 0x14→R3 + CT 0x12→R4 + R5 | 4-deposit; ×3.0(R3), ×2(R4 elems), ×9.0(R5) |
0x47 logistic | 0x130bc17b | CT 0x12→R4, CT 0x14→R3, CT 0x1a→R6 | sigmoid seq; ×2(R3 elems) |
0x38 erf | 0x130bc245 | fast: CT 0x11→R6; slow: R6+R3+R4+R5 | gated; slow ×16.0(R3), ×2(R4), ×4.0(R5) |
0x2a convert | 0x130bc293 | 1-bit: R5; else none | 1-bit: elems × 2.0; wider: 0 |
0x6c select | 0x130bc2a9 | R5 (LABEL_26 ×2 path) | elems × 2.0 |
0x52 parameter | 0x130bc2b8 | fused→RecordFusionInputCycles; else none | fused: input-DMA into MemXfer; non-fused: 0 |
0x5b reduce | 0x130bc2f9 | R5 over operand window | priced over reduced-OVER input window |
| DEFAULT (most ops) | 0x130bc3c0 | R5 VectorAluAny | elems × 1.0 |
| ZERO-cost | 0x130bc8c2 | (no deposit) | 0 |
The DEFAULT arm covers every numeric elementwise / structural op not named above (the switch's cases 4-23, 25, 27-38, 40, 43-49, 51-55, 57-66, 68-70, 72-74, 76-81, 83-90, 92-96, 98-107, 109-122, 124-128 per the decompiler's jumptable annotation). The ZERO arm covers the data-layout ops: bitcast (0x18), broadcast (0x1a), concatenate (0x27), constant (0x29), iota (0x43), reshape (0x61), tuple (0x81).
Fixed FP Multipliers (.rodata, byte-verified)
| Address | Value | Used by |
|---|---|---|
0xa2df230 | 1.0 | default per-op multiplier |
0xa2df930 | 3.0 | divide VectorAlu0 (R3) scale |
0xa2deb40 | 9.0 | divide VectorAluAny (R5) scale |
0xa2df040 | 16.0 | erf slow-path VectorAlu0 (R3) scale |
0xa2de830 | 4.0 | erf slow-path VectorAluAny (R5) scale |
0xa2df5c8 | 0.5 | MaxResourceCycles VectorAlu port-balance blend |
Considerations
The routing across the three VectorAlu slots is the port-balancing input to the MaxResourceCycles 0.5-blend group {R3, R4, R5} (see Resource Enum). Multiply pins VectorAlu0 (R3), floating add/subtract pin VectorAlu1 (R4), and the catch-all DEFAULT plus the ×2 ops (select, 1-bit convert) and integer add/subtract land on VectorAluAny (R5), the lane the blend redistributes. A fusion mixing multiplies, floating adds, and "any" ops therefore fills both dedicated lanes and the bundle cost is the balanced max, not the serial sum.
QUIRK — the bundle model prices a
broadcast(0x1a) at zero here, even though NormalizedComputationCost may charge a cross-lane-movement weight for the same op on the fusion-priority side. The two models disagree on layout ops by design: the bundle path treats a broadcast as a free register splat (no functional-unit occupancy), while the priority model accounts for the data movement. Do not unify them.
parameter and reduce route out of the ALU lanes. A fused parameter is an input-DMA priced by RecordFusionInputCycles (@ 0x130ce940) into the MemXfer slots R9..12; whether VMEM-resident params instead route to R7 VectorLoad per memory space is not byte-confirmed (MEDIUM). A non-fused parameter deposits nothing — case 0x52 falls straight to the finish path (if (!IsFused) goto LABEL_77), so a top-level argument carries zero functional-unit cost. A non-fused reduce is priced over its operand window via GetInputWindow (so cost scales with the large reduced-over tensor, not the small output) — the bundle-side mirror of HandleReduce's ExtentProduct(operand) flop formula below.
Base Flop Formulas (inherited from HloCostAnalysis)
TpuHloCostAnalysis overrides only gather/scatter/custom-call; every other op uses the base xla::HloCostAnalysis emitter unchanged. These are the flop properties cached in Properties[inst].float[+0x50] (flops) / [+0x54] (transcendentals), read back by flop_count (@ 0x1e4841e0).
Dot — 2·M·N·K
HandleDot (@ 0x1e47c9c0) → GetDotFlops (@ 0x1e47c7a0):
GetDotFlops(out_shape, lhs_shape, dnums): // sub_1E47C7A0
contract_batch = 1;
for d in dnums.contracting_and_batch_dim_indices:
contract_batch *= lhs_shape.dims[d]; // imul over the index list
return 2 * contract_batch * ExtentProduct(out_shape); // ×2 for multiply-add
The decompile is an unrolled imul chain over the DotDimensionNumbers index lists, then imul ExtentProduct(out), then add rax, rax (the ×2). This is the classic 2·M·N·K.
GOTCHA — raw
dotis lowered toconvolutionbefore the TPU cost path runs (dot/conv MXU lowering); reachingGetDotFlopson an un-lowered dot hits aCHECK-fatal (buffer != nullptr,shape.h:843). In practice the convolution formula below is what feeds the fusion priority.
Convolution — divides by both group counts
HandleConvolution (@ 0x1e480be0) calls GetConvolutionFlops (vtable+1128) and stores the result to +0x50. GetConvolutionFlops (@ 0x1e480060):
GetConvolutionFlops(inst, out, lhs, rhs): // sub_1E480060
out_spatial = ExtentProduct(window-iterated output spatial+batch); // v114
window_vol = Product(kernel spatial window sizes); // v108
in_feature = GetDimension(lhs, input_feature_dim); // Dimension
out_feature = GetDimension(out, output_feature_dim); // v115
fgc = inst.feature_group_count(); bgc = inst.batch_group_count();
return 2 * (out_feature / bgc) * (in_feature / fgc) * window_vol * out_spatial;
GOTCHA — The convolution flop divides by both
feature_group_countandbatch_group_count, not feature-group alone — two separateidiv/divsequences (v109 = Dimension/fgc,v110 = v115/bgc). Grouped convolutions (depthwise, batch-grouped) are mis-costed by any model that drops the batch-group divide.
This is the flop the NormalizedComputationCost conv path caches per unique_id and divides by the per-LHS-format peak to get MXU cycles.
Reduce / Reduce-Window — combiner cost × extent
HandleReduce (@ 0x1e47d6a0): flops = to_apply().per_element_cost × ExtentProduct(operand_being_reduced). The combiner subcomputation's per-element cost (flops, transcendentals, bytes — each scaled by vmulss) times the number of input elements reduced over. Cost scales with the large reduced-over tensor, not the small output.
HandleReduceWindow (@ 0x1e47e1c0): flops = to_apply().per_element_cost × Product(window dim sizes) × ExtentProduct(output_shape) — the combiner runs once per window element per output element.
Both feed the combiner cost through the base ProcessSubcomputation recursion (not re-walked here).
Elementwise — the flop/transcendental classifier
HandleElementwiseOp (@ 0x1e47b320) computes cost = ExtentProduct(output_shape) (1 op/element) and routes it by a switch on the opcode that selects the store offset:
HandleElementwiseOp(inst): // sub_1E47B320
cost = ExtentProduct(inst.shape);
off = is_transcendental(opcode) ? 0x54 : 0x50; // ecx = 84 : 80
Properties[inst].float[off] = (float)cost;
The 22-opcode transcendental set (stored to +0x54), byte-exact from the switch cases:
| Opcode | Name | Opcode | Name | Opcode | Name |
|---|---|---|---|---|---|
0x01 | acos | 0x2f | cosine | 0x55 | power |
0x02 | acosh | 0x30 | cosh | 0x68 | rsqrt |
0x0e | asin | 0x38 | erf | 0x75 | sine |
0x0f | asinh | 0x39 | exponential | 0x76 | sinh |
0x13 | atan2 | 0x3a | exponential-minus-one | 0x79 | sqrt |
0x14 | atanh | 0x45 | log | 0x7c | tan |
0x1c | cbrt | 0x46 | log-plus-one | 0x7d | tanh |
0x47 | logistic |
Every other elementwise op (add, multiply, subtract, and/or/xor, compare, clamp, negate, abs, floor, ceil, sign, round, shift, min, max, remainder, …) stores to +0x50 (flops), 1/element. The transcendental set the flop model routes to +0x54 is the same set the bundle model prices as multi-µop sequences (divide, logistic, erf above), so the two surfaces agree on which ops are expensive.
TPU-Specific Overrides
All three overrides are gated on *(byte*)(this + 456) — the model_tpu_specific_overheads option. When it is clear, each method tail-calls its base HloCostAnalysis counterpart and the TPU model contributes nothing. When set but the Target pointer (this + 56) is null, gather/scatter raise FailedPrecondition("target_ not specified but model_tpu_specific_overheads ... enabled") (tpu_hlo_cost_analysis.cc).
HandleGather — chunk-ratio bytes, no flops
HandleGather (@ 0x130a2de0):
HandleGather(inst): // sub_130A2DE0
if !this[456]: return base::HandleGather(inst); // overrides off
if !this.target: FailedPrecondition(...);
out_sz = GetShapeSize(inst.output_shape); // bytes
ratio = GetGatherSizeInChunkRatio(target, inst); // sub_14A8E420
op1_sz = GetShapeSize(inst.operand(1).shape); // index operand
// four float fields stored at this+0x58/+0x6c/+0x70/+0x74 (bytes-accessed family)
store {out_sz, ratio-amplified bytes, op1_sz, ...} to Properties bytes fields;
Gather is modelled as a pure memory/DMA op — no flops are added. The cost is the granule "chunk ratio" (GetGatherSizeInChunkRatio @ 0x14a8e420): a chunk-aligned read-amplification factor over the output and index sizes. The decompile stores four float results (+0x58, +0x6c, +0x70, +0x74) via vcvtsi2ss/vmovss, i.e. into the bytes-accessed property family rather than a single field. The internal granule math of GetGatherSizeInChunkRatio is not enumerated (the amplification factor is confirmed; the formula is LOW).
HandleScatter — chunk-ratio bytes plus combiner flops
HandleScatter (@ 0x130a3160):
HandleScatter(inst): // sub_130A3160
if !this[456]: return base::HandleScatter(inst);
if !this.target: FailedPrecondition(...);
op2_sz = GetShapeSize(inst.operand(2).shape); // updates
ratio = GetScatterSizeInChunkRatio(target, inst); // sub_14A90CE0
op1_sz = GetShapeSize(inst.operand(1).shape); // indices
store ratio-amplified bytes into Properties bytes fields (this+0x58/0x5c/0x60/...);
// PLUS the scatter-combiner compute:
upd_extent = ExtentProduct(inst.operand(2).shape); // updates element count
per_elem = ProcessSubcomputation(inst.to_apply()); // vtable slot 150
for each non-zero combiner property p:
Properties[inst].float[p_offset] += per_elem[p] * upd_extent; // incl. +0x54 transcend.
Scatter = the gather-style chunk-ratio bytes term plus the update-combine compute. The decompile multiplies each non-zero combiner per-element property (flops +0x50, transcendentals +0x54, and the bytes fields) by ExtentProduct(updates) and folds it into Properties, iterating the combiner's property map via Properties::operator[] keyed by the standard cost-analysis property names. So a scatter with a transcendental combiner correctly bumps the transcendental count, not just flops. GetScatterSizeInChunkRatio is @ 0x14a90ce0 (granule formula LOW).
HandleCustomCall — target-keyed registry, else base
HandleCustomCall (@ 0x130a35c0):
HandleCustomCall(inst): // sub_130A35C0
reg = CustomCallRegistration::GetGlobalRegistry(); // lazy __cxa_guard; FunctionRegistry
cb = reg.Get(inst.custom_call_target()); // string-keyed lookup
if cb found:
return cb(inst, &this.shape_size_fn /*this+240*/, &this.Properties /*this+80*/);
VLOG(1) "Custom hlo cost analysis not found: " << target;
return base::HandleCustomCall(inst); // sub_1E482A20
Custom-call is the open extensibility hook: a util_registration::FunctionRegistry<string, Status(HloInstruction*, function<long(Shape const&)> const&, Properties&)> keyed on the custom-call target string. A registered callback computes a bespoke flop/byte model (handed the instruction, the GetShapeSize functor at this+240, and the Properties map at this+80); an unregistered target logs a VLOG(1) miss and falls back to the base emitter. This mirrors the custom-fusion registry on the fusion side (fusion cost model).
Function Map
| Function | Address | Role |
|---|---|---|
TpuHloCostAnalysis::HandleGather | 0x130a2de0 | chunk-ratio bytes; gated on this+456 |
TpuHloCostAnalysis::HandleScatter | 0x130a3160 | chunk-ratio bytes + combiner flops/transcend. |
TpuHloCostAnalysis::HandleCustomCall | 0x130a35c0 | FunctionRegistry on target; else base |
GetGatherSizeInChunkRatio | 0x14a8e420 | gather read-amplification factor |
GetScatterSizeInChunkRatio | 0x14a90ce0 | scatter write-amplification factor |
HloCostAnalysis::GetShapeSize | 0x1e47a6e0 | byte size of a shape |
base HloCostAnalysis::HandleCustomCall | 0x1e482a20 | fallback emitter |
Worked Example — elementwise fusion bundle deposit
A loop-fusion body { multiply [256,128] (f32), add [256,128] (f32), tanh [256,128] }. element_count = 256·128 = 32768. Each op routes through the RecordHloCycles jump table:
multiply (0x4b) → CT 0x14 Shuffle → GetResource → R3 VectorAlu0: Acc(R3, 32768 × thru(0x14))
add (0x03), f32 → CT 0x12 RotIn → GetResource → R4 VectorAlu1: Acc(R4, 32768 × thru(0x12))
tanh (0x7d) → DEFAULT block → R5 VectorAluAny: Acc(R5, 32768 × 1.0)
The flop side classifies tanh as transcendental (Properties+0x54) and multiply/add as flops (Properties+0x50); the bundle side prices all three via the table above. MaxResourceCycles' 0.5-blend on {R3, R4, R5} runs the multiply (VectorAlu0) and add (VectorAlu1) lanes in parallel, load-balances the tanh "any" work into the less-busy lane, and overlaps the residual at 50%, so the bundle cost is ≈ max(R3, R4) + half the tanh residual, not the serial sum. Had add been integer-typed, it would have joined tanh on R5 (per HCA-1) rather than pinning R4.
Related Components
| Component | Relationship |
|---|---|
| GetHloResources Routing | The full GetHloResourcesImpl → sub-emitter routing this page summarizes |
| NormalizedComputationCost | The fusion-priority consumer of flop_count / the conv flop |
| Resource Enum (23-slot) | The ResourceVector slots and MaxResourceCycles reduction the deposits feed |
| Bundle-Aware Cost | The loop composition over the per-op deposits |
| Per-Opcode Cycle Constants | The per-gen GetCyclesForThroughput integers multiplied into each deposit |
Cross-References
- Cost Model Overview — the three class families that build the per-gen
CycleTable/Performancethe deposits query - GetHloResources Routing — the five-way dispatch and the fusion-peel router into
RecordHloCycles - NormalizedComputationCost — the opcode→weight fusion-priority model that consumes
flop_countand the conv flop - Resource Enum (23-slot) — the
Resourceslot names,Acc, and theMaxResourceCyclesoverlap model - Per-Opcode Cycle Constants — the per-gen
GetCyclesForThroughput(CT::Instruction)integers - CycleTable Family —
GetResource(op→slot) and theInstructionbucket enum - Bundle-Aware Cost — the software-pipelined loop cost over the per-op deposits
- dot/conv MXU Lowering — why raw
dotis lowered toconvolutionbefore the cost path - Fusion Cost Model — the custom-fusion registry that mirrors the custom-call hook
- Compiler Overview — where HLO cost analysis sits in the compile pipeline