Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ConvolutionCostState

All addresses on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). The binary is not stripped — every symbol below is a demangled C++ name. Section map: .text/.rodata VMA == file offset; .data.rel.ro VMA − 0x200000 == file offset. All addresses are VMA.

Abstract

ConvCostState is the running state the cost model builds once per convolution before it prices the op. It is the conv-shaped analogue of an LLVM MachineFunction's analysis cache: extract the conv-like HLO once, decode its ConvolutionDimensionNumbers, kernel Shape, and Window, map each logical dimension to its physical (layout-mapped) position, run the Target's per-axis chunk grid over the three operand shapes, and stash the result so the per-kernel pricing leaf can read it directly. The struct is built by CostModel::GetConvolutionCostState @0x130a4b20 and is the const ConvCostState& argument carried through the whole conv-cost emitter chain.

There are in fact two distinct structs, and conflating them is the central trap. ConvCostState is the large persistent state (1180+ bytes — protos at +0x158/+0x1e8/+0x328, six layout-mapped dim sizes at +0x350..+0x378, a set of per-dim vector masks at +0x380..+0x458, the 21-byte ConvolutionLoweringStrategy dtype/pack flags at +0x470..+0x484, and three inlined_vector<long,6> chunk-count grids at +0x8/+0x78/+0xe8). ConvState is the small per-call dim-product struct (+0x0..+0x78) that CostModel::RecordConvolutionCycles @0x130b6ce0 builds inside the deposit, on the stack, from ConvCostState's chunk-count vectors divided by the Target geometry. The pricing leaf CostModel::RecordConvKernelCycles @0x130caf20 reads dtype/pack flags from ConvCostState and dim products from ConvState, multiplies them into an op count, and deposits cycles into the Matmul / Matpush / Xlu resource slots.

This page documents the field offsets of both structs and the throughput bridge that turns the conv's op count into cycles. The bridge is the central detail: the matmul-rate multiplier the conv deposit needs is VfCycleTable::GetCyclesForThroughput(CT 0), which is not a constant — it is a call into MxuLatencyTable::GetResourceUsage(instr 212, res 3), and the res argument is a resource-index remap (res 3 → array index 15 = MatmulAccA; res 11 → array index 0 = MatpushPushPort), not a cycle seed. So the VfCycleTable throughput integer and the MxuLatencyTable reservation cycle of mxu-latency-overview are the same number viewed two ways.

For reimplementation, the contract is:

  • The ConvCostState field map: the three copied protos, the six logical→physical dim sizes, the per-dim vector masks, the ConvolutionLoweringStrategy flag bytes (read only for opcode 0x2b=convolution), and the three chunk-count vectors.
  • The ConvState field map: the dim products at +0x8..+0x78, the four shape-case flag bytes at +0x0..+0x6, and the Product(chunk_vector)/SublaneCount/ChunksPerTile formula that fills each.
  • The Matmul/Matpush/Xlu deposit: op count × thru(CT) × format-rate (0.5/0.25) ÷ Target+0x4ac (matmul rate) or Target+0x4b0 (Xlu rate) ÷ ElementPackingFactor.
  • The VfCycleTable → MxuLatencyTable bridge and its res→index remap.
  • That Target+0x4ac/+0x4b0 are Target vector-ISA rate fields, not ConvCostState-local derate fields.
Structsxla::jellyfish::CostModel::ConvCostState (persistent) · xla::jellyfish::CostModel::ConvState (per-call)
BuilderCostModel::GetConvolutionCostState @0x130a4b20 (3520 B)
ConvState buildCostModel::RecordConvolutionCycles (4-RV) @0x130b6ce0 (built at rbp-0x1d0)
Pricing leafCostModel::RecordConvKernelCycles @0x130caf20
Throughput bridgeVfCycleTable::GetCyclesForThroughput @0x1c89e2c0MxuLatencyTable::GetResourceUsage @0x1c8ae5c0
thru(CT 0)GetResourceUsage(instr 212, res 3, false) = array[15] = MatmulAccA = 8 (bf16)
thru(CT matpush)GetResourceUsage(instr 267, res 11, false) = array[0] = MatpushPushPort = 2 (bf16)
Geometry divisorsTarget::SublaneCount(), LaneCount(), ChunksPerTile()
Rate divisorsTarget+0x4ac (matmul), Target+0x4b0 (Xlu) — vector-ISA fields

ConvCostState — The Persistent State

Purpose

GetConvolutionCostState @0x130a4b20 has the signature bool GetConvolutionCostState(const HloInstruction* inst, ConvCostState* out). It extracts the conv-like HLO (fusion_util::ExtractConvLikeHlo @0x1d6aa140; a null result is a LogMessageFatal on conv_like_hlo != nullptr at cost_model.cc:3374), then fills the struct top-to-bottom: copy the three shape/window protos, run ChunkCountsWithTmp over the three operand shapes, map each logical conv dimension to its physical position via LayoutUtil::MakeLogicalToPhysical, initialise the per-dim vector masks, classify each dim (batch / outer / inner) into the three mask vectors, and finally — only for a real convolution opcode — copy the ConvolutionLoweringStrategy flag block.

Field Map

a3 is the ConvCostState*. Offsets are byte offsets from the struct base; the decompiler renders them as (char*)a3 + N (proto copies) or *((_QWORD*)a3 + N/8) (scalar/pointer stores).

OffsetContentStore site / source
+0x0const HloInstruction* conv*(_QWORD*)a3 = ExtractConvLikeHlo(inst) @0x130a4b4f
+0x8inlined_vector<long,6>operand-shape chunk countsChunkCountsWithTmp(operand_shape)Assign(a3+1) @0x130a4c25
+0x78inlined_vector<long,6>output-shape chunk countsChunkCountsWithTmp(output_shape)Assign(a3+15) @0x130a4c38
+0xe8inlined_vector<long,6>kernel-shape chunk countsChunkCountsWithTmp(kernel_shape)Assign(a3+29) @0x130a4c58
+0x158ConvolutionDimensionNumbersCopyFrom((char*)a3 + 344, …) @0x130a4b79
+0x1e8Shape (operand/kernel shape)Shape::operator=((char*)a3 + 488, …) @0x130a4b90
+0x328WindowWindow::CopyFrom((char*)a3 + 808, …) @0x130a4baa
+0x350output-feature dim SIZE (physical)logical→physical [+106] @0x130a51f7
+0x358input-feature dim SIZE (physical)logical→physical [+107] @0x130a5212
+0x360output-spatial dim SIZE (physical)logical→physical [+108] @0x130a5235
+0x368input-spatial dim SIZE (physical)logical→physical [+109] @0x130a5250
+0x370kernel-feature dim SIZE (physical)logical→physical [+110] @0x130a5279
+0x378kernel-spatial dim SIZE (physical)logical→physical [+111] @0x130a529b
+0x380vector<bool> — per-dim mask A (batch-dim)assign(N, 0) @0x130a52bb
+0x398vector<bool> — per-dim mask B (outer-dim)assign(N, 0) @0x130a52dd
+0x3b0vector<bool> — per-dim mask C (inner-dim)assign(N, 0) @0x130a52ff
+0x3c8 / +0x3e0 / +0x3f8vector<long> (init = 1)assign(N, 1) @0x130a531e/533d/535c
+0x410vector<bool>assign(N, 0) @0x130a537e
+0x428vector<long> (init = 1)assign(N, 1) @0x130a539d
+0x440 / +0x458vector<long> (init = 0)assign(N, 0) @0x130a53bc/53e2
+0x470..+0x484ConvolutionLoweringStrategy (21 B)GetConvolutionLoweringStrategy, copied only when opcode == 0x2b @0x130a5705..577a

NOTE — The three chunk-count inlined_vector<long,6> members sit at +0x8/+0x78/+0xe8 on a 0x70 stride (112 B per member, including its header): the three Storage<long,6>::Assign calls target (__int64*)a3 + 1, +15, +29 immediately after the three ChunkCountsWithTmp calls. The first vector is fed from the operand shape (v128), the second from the output shape, the third from the kernel shape — easy to mis-bind if the operand vector is overlooked.

The ConvolutionLoweringStrategy Flag Block

The +0x470..+0x484 region is the 21-byte ConvolutionLoweringStrategy of mxu-latency-overview's sibling. It is copied as one vmovups [r15+0x470] of the 16-byte head plus a qword tail at +0x47d, only when the conv-like HLO's opcode byte ([conv+12]) is 0x2b (convolution). RecordConvKernelCycles reads it byte-by-byte as the dtype / packing decision. The byte offsets within the struct (confirmed by the byte reads in @0x130caf20):

CCS offsetDecimalFlag (inferred name)Read in RecordConvKernelCycles
+0x4701136perform_batch_packed_activation_reorg== 1 gate @0x130cb040
+0x4761142perform_activation_if_bf16_packing!= 0 / == 1 gate @0x130cb1b8/cb2bb
+0x4771143generate_combined_x8_packed_matmuls_and_latchesOR-combine @0x130caf38
+0x4781144generate_combined_x4_packed_matmuls_and_latchesOR-combine @0x130caf38
+0x4791145generate_post_xpose_x8_packed_matmulsOR-combine @0x130caf38
+0x47a1146generate_bf16_packed_vmatmuls^1 index-select @0x130cb37e
+0x47e1150perform_activation_bf16_packingOR-combine @0x130caf38
+0x4801152generate_x8_packed_vlatches== 1 (0.25 quarter-rate gate) @0x130cb468
+0x4811153generate_x8_packed_vlatches_for_activationsOR-combine @0x130caf38
+0x4821154generate_x8_packed_vmatmulsOR-combine @0x130caf38

These bytes OR-combine (@0x130caf38..af86) to decide whether TransferSizeUtil::ElementPackingFactor @0x1d6b03e0 is consulted (called only if any flag is set) versus the rbx = 1 / 2 default. The flag names are inferred from the ConvolutionLoweringStrategy proto field order and the gate semantics; the offsets and read sites are byte-confirmed.


ConvState — The Per-Call Dim-Product Struct

Purpose

ConvState is built on the stack (at rbp-0x1d0) inside RecordConvolutionCycles (4-RV) @0x130b6ce0 and passed by const& to RecordConvKernelCycles. It is not persisted. Each dim field is a Product(chunk_dim_sizes) over one of ConvCostState's chunk-count vectors, divided by a Target geometry constant (SublaneCount() = 8, LaneCount() = 128, ChunksPerTile() = 16), with paired remainder / divisibility flags. The three xla::Product calls (@0x130b6f1d/6f43/6f69) reduce the operand / kernel / output chunk vectors; the four leading bytes are the conv-shape-case flags the deposit branches on.

Field Map

a3 (in RecordConvKernelCycles) is the ConvState*. The decompiler renders fields as a3[k] (8-byte stride) or a3[byte].

Offseta3[k]Content (inferred role)
+0x0a3[0] byteshape-case flag ("input == kernel" class)
+0x2a3[2] byteshape-case flag ("dim ≥ 2" class)
+0x3a3[3] bytefeature-dim selector (used as [+0x3]^1 index pick)
+0x5a3[5] byteHasNonBatchDotBaseDilation result @0x130b79fc
+0x6a3[6] bytegrouped-conv gate (== 1)
+0x8a3[1]spatial / output dim product
+0x10a3[2]spatial / output dim product (minor gate)
+0x18a3[3]spatial dim product
+0x20a3[4]output dim product
+0x40a3[8]feature-group / kernel-feature product (read in ShouldPackInputFeature)
+0x48a3[9]window-volume / spatial product
+0x50a3[10]the M dim
+0x60a3[12]feature dim product ([+0x3]^1-indexed pair with +0x68)
+0x68a3[13]feature dim product (sibling of +0x60)
+0x70a3[14]output-spatial iteration count
+0x78a3[15]window-iteration count

The pair (+0x60, +0x68) plus the [+0x3]^1 XOR-select (@0x130cafc3) is the "which-feature-dim-first" choice: the deposit reads a3[16*a3[3] + 104] (=[+0x68]-region) and a3[16*(a3[3]^1) + 96] (=[+0x60]-region). The op count is a3[3] · a3[2] · a3[4] · a3[1] (the [+0x18]·[+0x10]·[+0x20]·[+0x8] block) × [+0x50] × feat0 × feat1, then a branch on the flag bytes [+0x0], [+0x2], [+0x3] folds in [+0x48] / [+0x70] / [+0x78] (the four conv-shape cases at @0x130cb011/0c0/0ee/113), finally divided by ElementPackingFactor and ChunksPerTile.

NOTE — the semantic role of each ConvState dim offset (which is output-feature vs input-feature vs spatial vs batch vs kernel-window) follows the imul chain and the ChunkCountsWithTmp source order; this build carries no DWARF, so the dimension names are an interpretation while the offsets and the formula shape are taken straight from the disassembly. The authoritative source is GetConvIterationCounts @0x130b3c80.


The Throughput Bridge — VfCycleTable → MxuLatencyTable

Why a Bridge Exists

RecordConvKernelCycles needs a throughput (cycles-per-op) for the matmul and the matpush, not the per-op latency of the base Performance grid. That throughput is the per-resource occupancy from mxu-latency-overview. VfCycleTable::GetCyclesForThroughput(CycleTable::Instruction) @0x1c89e2c0 is the adapter: it maps each 32-entry CycleTable::Instruction (CT) ordinal to the right call into the MXU occupancy table. The conv deposit reaches it through the Target's CycleTable vtable — the (*(…+16LL))(this, CT, …) indirect calls in @0x130caf20 (e.g. CT 0 at @0x130cb..., CT 1 at @0x130cb375).

The CT → (instr, res, transpose) Table

The decompile of @0x1c89e2c0 (a switch(a2) over CT, default → return 1) confirms three call shapes: matmul/matpush CTs call MxuLatencyTable::GetResourceUsage; mxres/Xlu CTs call ViperfishPerformance::GetResourceUsage(…, res 14) + 1; CT 10/16 are a LogMessageFatal("Unsupported PushGainsS4.") at cycle_table.cc:682.

CTCallRole
0MxuLat.GetResourceUsage(212 matmul', res 3, false)the conv matmul-rate thru(CT 0)
1MxuLat.GetResourceUsage(218 matmul'', res 3, false)matmul-rate (alt)
4MxuLat.GetResourceUsage(230 matmul, res 3, false)matmul-rate (base)
5MxuLat.GetResourceUsage(267 matpush, res 11, false)matpush-rate
6MxuLat.GetResourceUsage(271 matpush', res 11, false)matpush'-rate
9MxuLat.GetResourceUsage(277 matpush'', res 11, false)matpush''-rate
10, 16LogMessageFatal("Unsupported PushGainsS4.")abort
11MxuLat.GetResourceUsage(267, res 11, **true**)matpush-rate (transposed)
12MxuLat.GetResourceUsage(271, res 11, **true**)matpush'-rate (transposed)
15MxuLat.GetResourceUsage(277, res 11, **true**)matpush''-rate (transposed)
23VfPerf.GetResourceUsage(296, res 14) + 1mxres / Xlu (matres TC)
27VfPerf.GetResourceUsage(302, res 14) + 1mxres
28VfPerf.GetResourceUsage(287, res 14) + 1the conv MXRES (Xlu)
29VfPerf.GetResourceUsage(291, res 14) + 1mxres
31VfPerf.GetResourceUsage(298, res 14) + 1mxres
all othersreturn 1vector / matprep-bucket default

The transpose flag is the 4th GetResourceUsage arg; the matpush key pre-transforms (267 direct, 271 → latch_mode ^ 0xB, 277 → latch_mode | 0x14) are applied inside GetResourceUsage (see mxu-latency-overview).

The Throughput Bridge — res Is a Resource-Index Remap

MxuLatencyTable::GetResourceUsage(instr, res, transpose) @0x1c8ae5c0 does not index its array<int,19> reservation vector with res directly. It first remaps:

function GetResourceUsage(instr, res, transpose):       // VF @0x1c8ae5c0
    if   res == 3:  idx = 15        // → MatmulAccA
    elif res == 11: idx = 0         //  → MatpushPushPort
    else: return InvalidArgument("Unsupported kind of resource")   // mxu_latency_table_vf.cc:547

    // build the modifier key from the opcode, then find:
    switch (instr):
        case 212: key = MatmulModifier{format=1}; map = this+0x20
        case 218: key = MatmulModifier{format=2}; map = this+0x20
        case 230: key = MatmulModifier{format=6}; map = this+0x20
        case 267: key = MatpushKey(latch_mode);          map = this+0x00
        case 271: key = MatpushKey(latch_mode ^ 0xB);    map = this+0x00
        case 277: key = MatpushKey(latch_mode | 0x14);   map = this+0x00
        default:  LogFatal("Unsupported opcode")          // vf.cc:578

    entry = map.find(key)                 // miss → out_of_range
    array<int,19> v = entry.value         // vmovups: matmul reads [rdx+8], matpush reads [rdx+4]
    CHECK(idx < 0x13)                      // == 19
    return (float) v[idx]                  // the hold-cycle count

So the res argument 3/11 is a named-sub-resource selector, remapped to the concrete array index (3 → 15 = MatmulAccA, 11 → 0 = MatpushPushPort), and that indexes the per-modifier reservation array. The matmul CTs (0, 1, 4) therefore read array[15]; the matpush CTs (5, 6, 9, 11, 12, 15) read array[0]. The VfCycleTable throughput integer is the MxuLatencyTable reservation cycle — the two cost tables are the same numbers.

thru(CT 0) — matmul rate (array[15] = MatmulAccA, per MatmulModifier format)value
bf16 (format group 1)8
format group 216
int8 / wide (format group 4)32
thru(CT 5) — matpush rate (array[0] = MatpushPushPort, per MatpushModifier)value
bf16 narrow2
bf16 mid4
int8 / x8 wide8

NOTE — the non-bf16 entries follow the per-format reservation groups; the live MatmulModifier/MatpushModifier maps per MatmulDataFormat give the exact values. The {2,1,1} / {4,3,2} / {8,7,6} reservation triplet's first element (the MatpushPushPort value) is exactly thru(CT matpush); the other two (matrix-staging-register holds) are read by separate GetResourceUsage calls in the MXU scheduler, not by this bridge.

Per-Generation Bridge

The bridge is per-generation. The Ghostlite helper GlcCycleTable::GetCyclesForThroughputHelper @0x1c89ed20 is structurally identical but uses the Ghostlite instruction set — matmul CTs (0, 1, 4) use instrs 292/298/310 (0x124/0x12a/0x136); matpush CTs (5, 6, 9, 10) use 347/349/356/358 (0x15b/0x15d/0x164/0x166); the GL MxuLatencyTable array<int,11> remaps res 4 → idx 3 and res 9 → idx 9. GfcCycleTable::GetCyclesForThroughputHelper @0x1c89f400 is the TPU7x variant with the same shape (array<int,11>). A reimplementation must read the (instr, res) pair from the per-gen helper.


The Deposit — Putting It Together

RecordConvKernelCycles @0x130caf20 computes one op count from ConvState and deposits cycles into three slots (ResourceVector::Acc(slot, cycles)):

SlotAcc indexDepositSite
MatmulR[1]op_count × thru(CT 1) × 0.5 ÷ Target+0x4ac ÷ ElementPackingFactor@0x130cb... (Acc @line 560)
MatpushR[0]matpush_count × thru(CT matpush)Acc @0x130cb... (@line 609)
XluR[2]thru(CT mxres) × ChunksPerTile × remainder ÷ Target+0x4b0Acc @line 259/428

The 0.5 (.rodata 0xa2df5c8) is the half-rate the matmul multiplies by when the format flag v166 is clear; the 0.25 (.rodata 0xa2df6d0) replaces it only when ConvCostState+0x480 (generate_x8_packed_vlatches) == 1 and Target+0x398 == 2 (the topology gate at @0x130cb468/cb471). The matmul-rate divisor is read as vcvtsi2sd xmm1, [rax+0x4ac] (Target+0x4ac); the Xlu divisor as [rax+0x4b0] reached through the CycleTable's Target pointer.

GOTCHA — Target+0x4ac (matmul rate) and Target+0x4b0 (Xlu rate) are Target fields, not ConvCostState fields. They are chip-wide vector-ISA throughput-rate parameters populated in Target::Init from TpuSequencerParts::vector_isa() @0x20b31840: vector_isa[+0xc] (8 bytes) → Target+0x4ac & +0x4b0; vector_isa[+0x14]Target+0x4a8. The conv deposit reads them through *(_QWORD*)a1 (the CostModel's Target*) and through the CycleTable's Target*, never out of ConvCostState — they are not a conv-local derate.

Because Matmul, Matpush, and Xlu (R[0]/R[1]/R[2]) are in the plain-MAX group of MaxResourceCycles (the MXU pipes overlap — see resource-enum), a compute-bound conv's bundle cost is ≈ max(Matmul, Matpush, Xlu), not their sum.


Worked Example — bf16 Conv on Viperfish

thru(CT 0)  = VfCycleTable::GetCyclesForThroughput(0)
            = MxuLatencyTable::GetResourceUsage(instr=212, res=3, false)
            = MatmulModifier{format=1 bf16}, array[remap(res=3)=15]
            = array[15] = MatmulAccA = 8

R[1] Matmul  = op_count × 8 × 0.5 ÷ Target[+0x4ac] ÷ ElementPackingFactor[bf16=1]

thru(CT 5)   = GetResourceUsage(instr=267, res=11, false) = array[0] = MatpushPushPort = 2
R[0] Matpush = matpush_tile_count × 2

thru(CT 28)  = VfPerf.GetResourceUsage(287, res=14) + 1            (the Xlu mxres-read rate)
R[2] Xlu     = thru(CT 28) × ChunksPerTile × remainder ÷ Target[+0x4b0]

bundle cost ≈ max(R[0], R[1], R[2])                                (MXU pipes overlap)

For an int8-x8 conv the matmul format selects group 4 → array[15] = 32 (matmul), array[0] = 8 (matpush) — 4× the bf16 rate, the four-byte-plane x8 latch sequence.


Sibling — Reduce-Window Cost Reuses ConvState

CostModel::RecordReduceWindowCycles @0x130b5ec0 is the pooling sibling. It takes three Windows directly (not a ConvCostState) and builds the same stack ConvState at rbp-0x1d0 by the same Product / SublaneCount / ChunksPerTile division. A trivial path: when CostModel+0x14 is false it deposits zero into R[7] VectorLoad, R[3] VectorAlu0, R[4] VectorAlu1, R[2] Xlu and returns (Acc(a9, 7/3/4/2) @line 782..788). Otherwise it dispatches on GetReduceWindowType @0x1454d4a0 (0 → RecordLaneReduceWindowCycles, 1 → RecordSublaneReduceWindowCycles, 2 → RecordMajorReduceWindowCycles), the Lane and Sublane paths gated on !HasBaseDilation. The lane leaf (@0x130c97e0) deposits R[7] VectorLoad (op_volume = ConvState[+0x20]·[+0x18]·[+0x70]), an R[5] VectorAluAny per-element combine (thru(CT 22)), the UpdateCostBasedOnReductionFunction combiner cost (count = op_volume × (ConvState[+0x68] − 1)), and an R[2] Xlu cross-lane drain (thru(CT 27) ÷ Target+0x4b0). The full pooling cost is on reduce-window-pooling-cost.


Function Map

FunctionAddressRole
CostModel::GetConvolutionCostState0x130a4b20builds ConvCostState (3520 B)
CostModel::RecordConvolutionCycles (4-RV)0x130b6ce0builds ConvState, calls the kernel leaf
CostModel::RecordConvKernelCycles0x130caf20dim → MXU deposit (R[0]/R[1]/R[2])
VfCycleTable::GetCyclesForThroughput0x1c89e2c0CT → MXU-occupancy adapter
viperfish::MxuLatencyTable::GetResourceUsage0x1c8ae5c0res→index remap + array[idx]
GlcCycleTable::GetCyclesForThroughputHelper0x1c89ed20Ghostlite bridge (array<int,11>)
GfcCycleTable::GetCyclesForThroughputHelper0x1c89f400TPU7x bridge (array<int,11>)
fusion_util::ExtractConvLikeHlo0x1d6aa140conv-like HLO extractor
convolution_util::GetConvolutionLoweringStrategy0x13192820the 21-byte flag block
Target::ChunkCountsWithTmp0x10c8b6e0per-axis chunk grid → +0x8/+0x78/+0xe8
TransferSizeUtil::ElementPackingFactor0x1d6b03e0per-dtype packing divisor
ResourceVector::Acc0x1c89adc0the slot deposit

NameRelationship
mxu-latency-overviewthe MxuLatencyTable model the bridge reads — the reservation array thru(CT) returns
vf-cycletablethe 32-entry CycleTable::Instruction dump and the full throughput bridge per gen
window-description-costthe conv/DMA byte+throughput primitive feeding ChunkCountsWithTmp
reduce-window-pooling-costthe pooling sibling reusing the same ConvState build
resource-enumthe 23-slot ResourceVector the deposit writes; the MaxResourceCycles overlap model

Cross-References

  • MXU Latency Overview — the MxuLatencyTable reservation model, the MatmulModifier/MatpushModifier keys, and the array[idx] read this bridge surfaces
  • VfCycleTable — the 32-entry CT → (instr, res) dump and the full GetCyclesForThroughput per-gen bridge
  • WindowDescription Byte-Cost — the conv/DMA byte + throughput primitive
  • Reduce-Window / Pooling CostRecordReduceWindowCycles, the pooling sibling that reuses ConvState
  • Resource Enum (23-slot) — the ResourceVector slots (Matmul/Matpush/Xlu) the conv deposit writes and the MaxResourceCycles overlap rule
  • MXU Slot — the LLO MXU instructions (matmul / matpush / matres) the conv deposit prices