Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

MXU Latency: GF (6acc60406)

All addresses on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build libtpu_lts_20260413_b_RC00, BuildID md5 89edbbe81c5b328a958fe628a9f2207d). The binary is not stripped. Section map: .text/.rodata VMA == file offset; .data.rel.ro VMA − 0x200000 == file offset; .data VMA − 0x400000 == file offset. All cell integers were read directly from the GF MxuLatencyTable constructor.

Abstract

This page dumps the 6acc60406 (v7, "GF") MxuLatencyTable: the per-modifier × MxuResource reservation matrix that prices how long each MXU op holds each internal sub-unit. It is the 6acc60406 sibling of the Viperfish array<int,19> and Ghostlite array<int,11> matrices, read by the byte-identical model documented in MXU Latency Overview. 6acc60406 and Ghostlite share the shape (array<int,11>, the same modifier key types, the same GainLatchMode key helpers) but not the instance: the two generations are built by distinct constructors (GlcCycleTable vs GfcCycleTable), carry distinct base op-latency costs, and read the cells through distinct accessors. Everything below is the GF instance, allocated by GfcCycleTable and built by the constructor whose CHECK strings name mxu_latency_table_gf.cc.

Two facts make the 6acc60406 numbers worth pinning. First, on a 256×256 systolic array the per-matmul hold is half Viperfish's: bf16 matmul holds the accumulate port for 4 cycles (VF: 8), so a back-to-back bf16 matmul stream pipelines at twice the VF issue rate. Second, the GF lookup reads the reservation array directly by MxuResource index — array[3] for matmul throughput, array[8] for matpush throughput — with no Ghostlite-style default-seed remap. The lookup also carries an fp8-fnuz (F8E4M3Fn, F8E5M2) GLM transform path that routes the fp8 latch modes into their own reservation buckets.

The page closes with windowing_util::ComputeDmaLevels, decompiled in full, because the 6acc60406 matmul cost is multiplied by a DMA-efficiency factor whose input is the descriptor-fragment count this function computes. The fragment count buckets into one of {1.6, 1.3, 1.1, 1.05} (or 1.0 for a single contiguous transfer or a product > 31), and the bucketing is byte-exact.

For reimplementation, the contract is:

  • The GF MxuLatencyTable object layout (0xa0 bytes, maps at this+0x00/0x20/0x40/0x60/0x80) and the array[res & 0xf] = cycles insert mechanism with the < 11 kNumMxuResources bound.
  • The 16 matmul and 16 matpush reservation rows, cell-by-cell, plus the two Vlxmr rows and the MatmulDataFormat → base-latency cost map (kF32/kBf16 = 211, kF8E5M2/kF8E4M3Fn = 204).
  • The GetResourceUsage opcode dispatch (matmul 289/295/301/307; matpush 324–327 with the ^0xB/|0x30/|0x32 GLM pre-transforms), the direct array[3]/array[8] read, and the per-dtype throughput numbers.
  • ComputeDmaLevels: the DmaLevel struct, the contiguity-break level-build loop, the opcode_produced_register_type ∈ {2,4} merge gate, and the fragment-count → efficiency-multiplier table.
ClassGF/6acc60406 MxuLatencyTable (mis-symbolized; allocated by GfcCycleTable)
Object0xa0 B, at GfcCycleTable this+0x18; array<int,11> value type
Ctor@0x1c8bb1c0 — fills Matpush (+0x00), Matmul (+0x20), Vlxmr (+0x40), MatmulDataFormat→cost (+0x80)
Lookup@0x1c8bdb20 — direct array[3] (matmul) / array[8] (matpush); no remap
Insert helpersmatpush @0x1c8bc2e0, matmul @0x1c8bc540, vlxmr @0x1c8bc760array[res & 0xf] = cy
CT dispatchGfcCycleTable::GetCyclesForThroughputHelper @0x1c89f400
Resource countMxuResource::kNumMxuResources = 11 — CHECK-anchored (> 0xA → fatal)
Throughput (bf16 / fp8)matmul array[3] = 4 / 8 · matpush array[8] = 2 / 4
Base op latencykBf16/kF32 = 211 · kF8E5M2/kF8E4M3Fn = 204
ComputeDmaLevels@0x1c86a9e0 · efficiency buckets {1.6, 1.3, 1.1, 1.05}, 1.0 default (also for product > 31)

Object Layout and Build

Purpose

The GF table is built once, when GfcCycleTable is constructed. The cycle table allocates two heap objects: a 0x30-byte GhostlitePerformance table at this+0x10 (the per-opcode grid of Performance: GF) and the 0xa0-byte MxuLatencyTable at this+0x18. Both are filled by their own constructor; this page covers the second.

// GfcCycleTable::GfcCycleTable @0x1c89eec0
v4 = operator new(0x30u); sub_1C8D3740(v4);   // GhostlitePerformance grid → this+0x10
v6 = operator new(0xA0u); sub_1C8BB1C0(v6);   // MxuLatencyTable          → this+0x18

Structure

The constructor @0x1c8bb1c0 zero-inits five Abseil flat-hash-maps and a count word, then inserts every row:

MxuLatencyTable (0xa0 bytes, at GfcCycleTable + 0x18)
  this + 0x00   flat_hash_map<MatpushModifier,     array<int,11>>   ── latch / matprep
  this + 0x20   flat_hash_map<MatmulModifier,      array<int,11>>   ── matmul
  this + 0x40   flat_hash_map<VlxmrModifier,       array<int,11>>   ── vector-latch-into-MRB
  this + 0x60   (Matres map slot — NOT populated by the GF ctor)
  this + 0x80   flat_hash_map<MatmulDataFormat, int>                ── base op latency; count word = 1

NOTE — unlike the Ghostlite constructor (@0x1c8b2920, data-driven SetReservations<Modifier> + insert_range), the GF constructor writes the reservation arrays inline through three direct-write helpers, and it builds no separate Matres map (zero calls to a Matres helper). The +0x60 slot is zeroed and left empty. A reimplementation that mirrors the GL build path will emit a Matres map GF never has.

How a row is built

Each map value is an array<int,11> filled by array[resource & 0xf] = cycles for each (resource, cycles) pair, with the resource hard-bounded to kNumMxuResources:

// matpush insert @0x1c8bc2e0 / matmul insert @0x1c8bc540 (identical mechanism)
function insert(key, pairs[]):
    array<int,11> res_vector = {0};                 // zero-init all 11 slots
    for (resource, cycles) in pairs:
        if (resource > 0xA)                          // > 10 → fatal
            LogFatal("resource_index < to_underlying(MxuResource::kNumMxuResources)");
        res_vector[resource & 0xF] = cycles;
    target_map.try_emplace(key, res_vector);

The > 0xA bound is the 11-wide CHECK; the same < 11 guard reappears on the read side (below) at mxu_latency_table_gf.cc:415. Any resource a row does not name holds it for 0 cycles.


The Matmul Reservation Rows

Purpose

The matmul family is keyed by a 32-bit MatmulModifier whose byte[0] is the MatmulDataFormat code (1, 2, 9, 0xa), byte[1] is the transpose flag, and byte[2] is a high-variant bit. The constructor inserts 16 keys; GetResourceUsage reads cell array[3] (the matmul-accumulate throughput port).

The rows (all 16 keys, cell-by-cell)

The reservation triple is {res3, res9, res2} for formats 1/2, and {res3, res9} for formats 9/0xa, byte-confirmed in the constructor at @0x1c8bb7c9..0x1c8bbb31.

MatmulModifier keyformatres2 (issue stage)res3 (thru)res9 (acc latch)
0x00000001 0x00000101 0x00010001 0x000101011 (bf16)1643
0x00000002 0x000100022 (bf16-alt)2087
0x00000102 0x000101022, transpose1643
0x00000009 0x000100099 (F8E5M2)87
0x00000109 0x000101099, transpose21
0x0000000a 0x0001000a0xa (F8E4M3Fn)87
0x0000010a 0x0001010a0xa, transpose21

res3 is the value GetResourceUsage returns for a matmul; res2 is the larger issue-stage hold (16/20 cy) that gates the issue cursor; res9 is the secondary accumulate latch.

QUIRK — the transpose variants of the fp8 formats (...0109, ...010a) drop the matmul hold to {res3:2, res9:1}, the cheapest matmul row in the table — a transposed fp8 matmul reuses an already-loaded staging path and prices nearly free. The non-transpose fp8 rows are the most expensive (res3:8).


The Matpush Reservation Rows

Purpose

The matpush (latch / matprep) family is keyed by a MatpushModifier whose low three bytes are {[0]=GainLatchModeToMatmulDataFormat(mode), [1]=transpose, [2]=1} and whose high byte carries the MSR (matrix-staging-register) variant (0x01 vs 0x03). The constructor inserts 16 keys; GetResourceUsage reads cell array[8].

The rows (all 16 keys)

The reservation widens with the format, falling into three value-sets — narrow {res5/4:1, res7/6:1, res8:2, res10:7}, mid {:3, :2, res8:4, res10:9}, wide {:7, :6, res8:8} — byte-confirmed at @0x1c8bb28d..0x1c8bb737.

MatpushModifier keyvariantstaging A · Bres8 (thru)res10 (latch)width
0x..010001v1/v31 · 127narrow
0x..010101v1/v3, xpose3 · 24mid
0x..010002 0x..010009 0x..01000av1/v33 · 249mid
0x..010102 0x..010109 0x..01010av1/v3, xpose7 · 68wide

GOTCHA — the staging-port indices depend on the MSR variant the row is built for. The constructor selects res4/res6 for one MSR and res5/res7 for the other (a per-variant Unexpected MSR CHECK at mxu_latency_table_gf.cc:94 guards the selector). The throughput cell read by the lookup is always res8; the staging holds are what shift. Treating the staging indices as fixed across variants will mis-place a hold.


Throughput per Format — the Conv R[0]/R[1] Inputs

GetResourceUsage forces transpose to 0 when building the lookup key, so the convolution cost model reads the non-transpose rows. The GfcCycleTable::GetCyclesForThroughputHelper @0x1c89f400 binds each cost-table Instruction (CT ordinal) to an LLO opcode, a resource index, and a transpose flag:

CTLLO opcodefamilykey fmtres readcycles
0289 (0x121)matmul1 (bf16)array[3]4
1295 (0x127)matmul2array[3]8
2 / 4301 (0x12d)matmul9 (F8E5M2)array[3]8
3307 (0x133)matmul0xa (F8E4M3Fn)array[3]8
5324 (0x144)matpush1 (bf16, direct GLM)array[8]2
6326 (0x146)matpushGLM ^0xB (transpose flip)array[8]per fmt
7 / 9327 (0x147)matpushGLM |0x30 → fmt 9array[8]4
8325 (0x145)matpushGLM |0x32 → fmt 0xaarray[8]4

The matpush transpose CTs (11–15) re-enter sub_1C8BDB20 as sub_1C8BDB20(a1, op, opcode, 8, 1) — resource 8, latch_mode = 1 — selecting the transposed (wide) value-set (res8 = 8). The CT→opcode binding for the matmul CTs (0–4) is inferred from the resource index (3) and transpose (0) the helper passes plus the opcode the dispatch consumes; the matpush opcodes (324–327) are byte-confirmed in both the helper and sub_1C8BDB20.

6acc60406 vs VIPERFISH — 6acc60406 thru(matmul) bf16 = 4, fp8 = 8; thru(matpush) bf16 = 2, fp8 = 4. Viperfish (per VF) is bf16 = 8 / int8 = 32 for matmul and 2 / 8 for matpush. The 6acc60406 matmul hold is half VF's — the 256×256 systolic array doubles the per-cycle MAC rate, so the same op occupies the accumulate port for fewer cycles. The matpush narrow hold is unchanged (2), and the wide hold (8) equals the VF int8-wide cell.

The base op latency is a separate MatmulDataFormat → int map at this+0x80, named by the constructor's four try_emplace CHECK strings:

MatmulDataFormatbase latency
kF32, kBf16211
kF8E5M2, kF8E4M3Fn204

These are the v7 per-format op latencies; the v6e (Ghostlite) constructor stores 192/182 for the same formats, the clearest single number proving the two tables are distinct instances.


The Lookup

Algorithm

GetResourceUsage @0x1c8bdb20 (mis-symbolized as raw_hash_map::at) is the read path. It bounds-checks the resource, dispatches by opcode to a family, builds the key, finds the row, and reads the cell:

function GF_MxuLatencyTable_GetResourceUsage(out, this, instr, res, latch_mode):  // @0x1c8bdb20
    if res > 0xA:                                      // mxu_latency_table_gf.cc:415
        LogFatal("mxu_resource_idx < to_underlying(MxuResource::kNumMxuResources)")
    switch (instr.opcode):
        case 289: key = MatmulModifier{fmt=1};  map = this+0x20      // matmul
        case 295: key = MatmulModifier{fmt=2};  map = this+0x20
        case 301: key = MatmulModifier{fmt=9};  map = this+0x20
        case 307: key = MatmulModifier{fmt=10}; map = this+0x20
        case 324: key = MatpushKey(latch_mode);          map = this+0x00  // matpush
        case 325: key = MatpushKey(latch_mode | 0x32);   map = this+0x00  // → fp8-fnuz (F8E4M3Fn)
        case 326: key = MatpushKey(latch_mode ^ 0x0B);   map = this+0x00  // transpose flip
        case 327: key = MatpushKey(latch_mode | 0x30);   map = this+0x00  // → fmt 9
        default:  return MakeErrorImpl("Unsupported opcode: ...")   // :455
    entry = map.find(key)
    if not entry: throw out_of_range                   // raw_hash_map::at
    array<int,11> res_vector = entry.value             // vmovups: [rdx+8] matmul / [rdx+4] matpush
    out = res_vector[res]                              // DIRECT — no remap

The matmul key is {fmt, 0, 0, 0} (transpose forced to 0). The matpush key is assembled from the three shared helpers GainLatchModeToMatmulDataFormat @0x1d629260 (byte[0]), LatchModeIsTranspose @0x1d628ea0 (byte[1]), and LatchOpcodeToMsr(0x91) @0x1c8a1300 (byte[3]); byte[2] is the constant 1. For opcode 0x91, LatchOpcodeToMsr returns 0.

CONTRAST WITH GHOSTLITE — the GF read is array[res] for the resource the CT helper passes (3 for matmul, 8 for matpush) with no transform. The Ghostlite lookup @0x1c8b7560 instead seeds defaults (res4 → idx3, res9 → idx9) before reading. The two generations share the constructor shape and the key helpers, but the read indices and the seed remap differ — bind them per-gen. This is the res-remap 3/8 distinction: GF is direct on indices 3 and 8.

The matpush vmovups [rdx+4] vs matmul vmovups [rdx+8] reflects the differing key widths (MatpushModifier precedes its value by 4 bytes, MatmulModifier by 8). The whole array<int,11> is copied to the stack before indexing.

Vlxmr Rows

The Vlxmr map at this+0x40 carries two rows, inserted by sub_1C8BC760(0, &unk_B43CD6C, …) and sub_1C8BC760(257, &unk_B43CD74, …) (keys 0x0 and 0x101, byte-confirmed in the ctor). The insert helper iterates an (int resource, int cycles) pair stream, applying the same resource < 11 bound (CHECK at mxu_latency_table_gf.cc:50). The first pair at @0xb43cd6c reads {res0:2} (byte-confirmed). The full per-row pair set for key 0x101 is UNVERIFIED here — the iterator bounds are not statically pinned from rodata alone; see MXU Latency: GL (Ghostlite) for the sibling Vlxmr shape.

Function Map

FunctionAddressRole
GfcCycleTable::GfcCycleTable0x1c89eec0allocs 0x30 perf + 0xa0 MxuLatency
GF MxuLatencyTable ctor0x1c8bb1c0fills 16 matpush + 16 matmul + 2 vlxmr + cost map
GF MxuLatencyTable::GetResourceUsage0x1c8bdb20opcode dispatch + find + direct array[res]
matpush insert helper0x1c8bc2e0array[res & 0xf] = cy, < 11 bound
matmul insert helper0x1c8bc540as above, MatmulModifier map
vlxmr insert helper0x1c8bc760VlxmrModifier rows from rodata pairs
GfcCycleTable::GetCyclesForThroughputHelper0x1c89f400CT → opcode/res/transpose dispatch
GainLatchModeToMatmulDataFormat0x1d629260matpush key byte[0]
LatchModeIsTranspose0x1d628ea0matpush key byte[1]
LatchOpcodeToMsr0x1c8a1300matpush key byte[3]; (0x91) → 0

ComputeDmaLevels — the Fragment Count Bound to the Matmul Cost

Purpose

The convolution / windowed-transfer cost path multiplies its bandwidth term by a DMA-efficiency factor whose input is the number of descriptor fragments a window decomposes into. windowing_util::ComputeDmaLevels @0x1c86a9e0 computes that decomposition. The result feeds WindowCyclesGenericTargetAgnostic @0x14552180, which turns the fragment count into the efficiency multiplier.

The DmaLevel struct and the level-build loop

ComputeDmaLevels returns a std::vector<DmaLevel>, each element 24 bytes:

struct DmaLevel {                  // 24 B; vector stride lea[rax+rax*2]<<3
    long axis;                     // +0x00  first window-axis index of this level
    long count;                    // +0x08  init 1, then count *= window_stride[axis] per merged axis
    long flags;                    // +0x10  zero-init; level marker, not read by the cost path
};

The function first CHECKs that the window's per-axis inlined-vectors (window_bounds, pad_high, pad_low, and the dynamic-base second array) all have the same rank, then trims the contiguous minor dim when the window's +0x338 config scalar is ≥ 2 (the row-major innermost run is not counted as its own level):

rank = window_bounds.size()
CHECK(base_bounds.size() == rank && pad_high.size() == rank && pad_low.size() == rank)
N = rank - (WD[+0x338] >= 2 ? 1 : 0)                 // minor-dim trim

The level-build loop starts a new DmaLevel at each axis and merges following axes into it while all contiguity conditions hold:

for axis in 0 .. N:                                  // outer: start a new level
    level = {axis, count=1, flags=0}
    for r in axis+1 .. N:                            // inner: try to merge axis r
        if elemental_stride[r] != 1: break
        operand = stride_level_operand[r]            // WD + 0xe0 [r], an LloValue*
        if operand:
            if operand.opcode >= 0x1cd: fatal()
            rt = opcode_produced_register_type[operand.opcode]
            if rt != 2 && rt != 4: fatal()           // ProducesSreg() || ProducesVreg()
            contiguous = llo::KnownEq(window_stride[r], operand)
        else:
            contiguous = (window_stride[r] == base_bounds[r])
        if not contiguous: break
        if pad_low[r] != 0 || dilation[r] != 0: break
        level.count *= window_stride[r]              // "access_multiplier"
    emit(level)                                      // a finalized contiguity break

NOTE — the +0xe0 per-axis second array (stride_level_operand) is the DMA-level descriptor operand, distinct from window_stride (which drives the byte count). Its register class gates the merge: the LLO value must produce a Scalar (opcode_produced_register_type == 2, ProducesSreg()) or a Vector (== 4, ProducesVreg()) — a predicate (1), a mask (3), or a no-result op (0) fails the gate and forces a contiguity break. The CHECK at llo_util.h:35 (value->ProducesSreg() || value->ProducesVreg()) names this exactly. The canonical conv stride operands (kScalarAddressCalculation @0x86 etc.) are all Scalar, hence mergeable; a mask-typed stride operand fragments the DMA.

The number of emitted DmaLevels equals the number of contiguity breaks; the product of their count fields is the fragment count.

The fragment count → efficiency multiplier

WindowCyclesGenericTargetAgnostic @0x14552180 reads the level vector and buckets the fragment product v18 (the running access_multiplier from the merge loop), byte-exact:

fragment product v18multiplierrodata
≤ 1 level (no fragmentation)1.0@0xa2df230
== 11.6@0xa2d71a0
∈ [2, 3]1.3@0xa2d71a8
∈ [4, 7]1.1@0xa2df4c8
∈ [8, 31]1.05@0xa2df2a8
> 311.0 (falls through to default)@0xa2df230

All four read doubles are byte-confirmed in .rodata: 1.0 (@0xa2df230), 1.6 (@0xa2d71a0), 1.3 (@0xa2d71a8), 1.1 (@0xa2df4c8), 1.05 (@0xa2df2a8). The {1.6, 1.3} pair is a 2-entry table at @0xa2d71a0 indexed by setge(v18 ≥ 2); 1.1 and 1.05 are standalone constants. The bandwidth deposit returned is (byte_count · elem_size / bytes_per_cycle) · multiplier + residual.

GOTCHA — A fragment product > 31 receives no penalty multiplier — that branch (if (v18 > 31)) jumps past the bucket table and keeps the 1.0 default. The 2003.0 constant at @0xa2df1b0 (byte-confirmed) is not a fragment-count bucket: it is selected at function entry by an element-type predicate (window descriptor operand present and (*(int16*)(op+11) & 0x7C) == 0x10), independent of ComputeDmaLevels, and it scales the bytes_per_cycle divisor, not the fragment multiplier. Do not bind 2003.0 to the fragment count.

QUIRK — a single contiguous transfer (≤ 1 level) is multiplier 1.0, but a fragment product of exactly 1 (one level whose merged element-count product is 1) is 1.6. The two are distinct paths: the ≤ 1-level default (v60 <= 1) short-circuits before the bucket table; the v18 == 1 bucket is reached only when the level vector is non-trivial. A reimplementation must keep both.


Worked Example — 6acc60406 bf16 Conv Deposit

For a 6acc60406 bf16 convolution, the per-format MXU reservation reads are now numeric:

  • R[0] matpush = matpush_count · thru(CT5)[bf16] = matpush_count · array[8] = matpush_count · 2.
  • R[1] matmul = op_count · thru(CT0)[bf16] · 0.5 / Target / EPF = op_count · array[3] · 0.5 / … = op_count · 4 · 0.5 / ….
  • R[2] Xlu = 4 · ChunksPerTile · rem (the GhostlitePerformance grid cell, Performance: GF).

For fp8 (F8E5M2/F8E4M3Fn), R[0] = 4 and R[1] = 8 — double the bf16 hold. If the conv input window fragments into 2 DMA levels with a per-level element product of 3 (a 3-tap kernel axis with non-unit stride), ComputeDmaLevels returns 2 levels, v18 = 3, and the bandwidth deposit is multiplied by 1.3; a fully contiguous load multiplies by 1.0; a fragment product > 31 also falls back to 1.0 (no fragment-count penalty — the steering away from heavily strided layouts comes from the larger byte count, not a multiplier).


NameRelationship
mxu-latency-overviewthe shared model, key scheme, and GetResourceUsage read path
mxu-latency-glthe Ghostlite array<int,11> — same shape, distinct ctor, 192/182 base latency
mxu-latency-vfthe Viperfish array<int,19> — double the matmul hold
matmul-mode-modifiersthe modifier ordinals, format codes, GLM transforms
performance-gf-ghperfthe GF grid carrying the conv R[2] Xlu cell and the same per-dtype magnitudes
window-description-costthe windowed-transfer cost path that consumes ComputeDmaLevels

Cross-References