MXU Latency: GF (6acc60406)
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (buildlibtpu_lts_20260413_b_RC00, BuildID md589edbbe81c5b328a958fe628a9f2207d). The binary is not stripped. Section map:.text/.rodataVMA == file offset;.data.rel.roVMA − 0x200000 == file offset;.dataVMA − 0x400000 == file offset. All cell integers were read directly from the GFMxuLatencyTableconstructor.
Abstract
This page dumps the 6acc60406 (v7, "GF") MxuLatencyTable: the per-modifier × MxuResource reservation matrix that prices how long each MXU op holds each internal sub-unit. It is the 6acc60406 sibling of the Viperfish array<int,19> and Ghostlite array<int,11> matrices, read by the byte-identical model documented in MXU Latency Overview. 6acc60406 and Ghostlite share the shape (array<int,11>, the same modifier key types, the same GainLatchMode key helpers) but not the instance: the two generations are built by distinct constructors (GlcCycleTable vs GfcCycleTable), carry distinct base op-latency costs, and read the cells through distinct accessors. Everything below is the GF instance, allocated by GfcCycleTable and built by the constructor whose CHECK strings name mxu_latency_table_gf.cc.
Two facts make the 6acc60406 numbers worth pinning. First, on a 256×256 systolic array the per-matmul hold is half Viperfish's: bf16 matmul holds the accumulate port for 4 cycles (VF: 8), so a back-to-back bf16 matmul stream pipelines at twice the VF issue rate. Second, the GF lookup reads the reservation array directly by MxuResource index — array[3] for matmul throughput, array[8] for matpush throughput — with no Ghostlite-style default-seed remap. The lookup also carries an fp8-fnuz (F8E4M3Fn, F8E5M2) GLM transform path that routes the fp8 latch modes into their own reservation buckets.
The page closes with windowing_util::ComputeDmaLevels, decompiled in full, because the 6acc60406 matmul cost is multiplied by a DMA-efficiency factor whose input is the descriptor-fragment count this function computes. The fragment count buckets into one of {1.6, 1.3, 1.1, 1.05} (or 1.0 for a single contiguous transfer or a product > 31), and the bucketing is byte-exact.
For reimplementation, the contract is:
- The GF
MxuLatencyTableobject layout (0xa0bytes, maps atthis+0x00/0x20/0x40/0x60/0x80) and thearray[res & 0xf] = cyclesinsert mechanism with the< 11kNumMxuResourcesbound. - The 16 matmul and 16 matpush reservation rows, cell-by-cell, plus the two Vlxmr rows and the
MatmulDataFormat → base-latencycost map (kF32/kBf16= 211,kF8E5M2/kF8E4M3Fn= 204). - The
GetResourceUsageopcode dispatch (matmul 289/295/301/307; matpush 324–327 with the^0xB/|0x30/|0x32GLM pre-transforms), the directarray[3]/array[8]read, and the per-dtype throughput numbers. ComputeDmaLevels: theDmaLevelstruct, the contiguity-break level-build loop, theopcode_produced_register_type ∈ {2,4}merge gate, and the fragment-count → efficiency-multiplier table.
| Class | GF/6acc60406 MxuLatencyTable (mis-symbolized; allocated by GfcCycleTable) |
| Object | 0xa0 B, at GfcCycleTable this+0x18; array<int,11> value type |
| Ctor | @0x1c8bb1c0 — fills Matpush (+0x00), Matmul (+0x20), Vlxmr (+0x40), MatmulDataFormat→cost (+0x80) |
| Lookup | @0x1c8bdb20 — direct array[3] (matmul) / array[8] (matpush); no remap |
| Insert helpers | matpush @0x1c8bc2e0, matmul @0x1c8bc540, vlxmr @0x1c8bc760 — array[res & 0xf] = cy |
| CT dispatch | GfcCycleTable::GetCyclesForThroughputHelper @0x1c89f400 |
| Resource count | MxuResource::kNumMxuResources = 11 — CHECK-anchored (> 0xA → fatal) |
| Throughput (bf16 / fp8) | matmul array[3] = 4 / 8 · matpush array[8] = 2 / 4 |
| Base op latency | kBf16/kF32 = 211 · kF8E5M2/kF8E4M3Fn = 204 |
ComputeDmaLevels | @0x1c86a9e0 · efficiency buckets {1.6, 1.3, 1.1, 1.05}, 1.0 default (also for product > 31) |
Object Layout and Build
Purpose
The GF table is built once, when GfcCycleTable is constructed. The cycle table allocates two heap objects: a 0x30-byte GhostlitePerformance table at this+0x10 (the per-opcode grid of Performance: GF) and the 0xa0-byte MxuLatencyTable at this+0x18. Both are filled by their own constructor; this page covers the second.
// GfcCycleTable::GfcCycleTable @0x1c89eec0
v4 = operator new(0x30u); sub_1C8D3740(v4); // GhostlitePerformance grid → this+0x10
v6 = operator new(0xA0u); sub_1C8BB1C0(v6); // MxuLatencyTable → this+0x18
Structure
The constructor @0x1c8bb1c0 zero-inits five Abseil flat-hash-maps and a count word, then inserts every row:
MxuLatencyTable (0xa0 bytes, at GfcCycleTable + 0x18)
this + 0x00 flat_hash_map<MatpushModifier, array<int,11>> ── latch / matprep
this + 0x20 flat_hash_map<MatmulModifier, array<int,11>> ── matmul
this + 0x40 flat_hash_map<VlxmrModifier, array<int,11>> ── vector-latch-into-MRB
this + 0x60 (Matres map slot — NOT populated by the GF ctor)
this + 0x80 flat_hash_map<MatmulDataFormat, int> ── base op latency; count word = 1
NOTE — unlike the Ghostlite constructor (
@0x1c8b2920, data-drivenSetReservations<Modifier>+insert_range), the GF constructor writes the reservation arrays inline through three direct-write helpers, and it builds no separate Matres map (zero calls to a Matres helper). The+0x60slot is zeroed and left empty. A reimplementation that mirrors the GL build path will emit a Matres map GF never has.
How a row is built
Each map value is an array<int,11> filled by array[resource & 0xf] = cycles for each (resource, cycles) pair, with the resource hard-bounded to kNumMxuResources:
// matpush insert @0x1c8bc2e0 / matmul insert @0x1c8bc540 (identical mechanism)
function insert(key, pairs[]):
array<int,11> res_vector = {0}; // zero-init all 11 slots
for (resource, cycles) in pairs:
if (resource > 0xA) // > 10 → fatal
LogFatal("resource_index < to_underlying(MxuResource::kNumMxuResources)");
res_vector[resource & 0xF] = cycles;
target_map.try_emplace(key, res_vector);
The > 0xA bound is the 11-wide CHECK; the same < 11 guard reappears on the read side (below) at mxu_latency_table_gf.cc:415. Any resource a row does not name holds it for 0 cycles.
The Matmul Reservation Rows
Purpose
The matmul family is keyed by a 32-bit MatmulModifier whose byte[0] is the MatmulDataFormat code (1, 2, 9, 0xa), byte[1] is the transpose flag, and byte[2] is a high-variant bit. The constructor inserts 16 keys; GetResourceUsage reads cell array[3] (the matmul-accumulate throughput port).
The rows (all 16 keys, cell-by-cell)
The reservation triple is {res3, res9, res2} for formats 1/2, and {res3, res9} for formats 9/0xa, byte-confirmed in the constructor at @0x1c8bb7c9..0x1c8bbb31.
MatmulModifier key | format | res2 (issue stage) | res3 (thru) | res9 (acc latch) |
|---|---|---|---|---|
0x00000001 0x00000101 0x00010001 0x00010101 | 1 (bf16) | 16 | 4 | 3 |
0x00000002 0x00010002 | 2 (bf16-alt) | 20 | 8 | 7 |
0x00000102 0x00010102 | 2, transpose | 16 | 4 | 3 |
0x00000009 0x00010009 | 9 (F8E5M2) | — | 8 | 7 |
0x00000109 0x00010109 | 9, transpose | — | 2 | 1 |
0x0000000a 0x0001000a | 0xa (F8E4M3Fn) | — | 8 | 7 |
0x0000010a 0x0001010a | 0xa, transpose | — | 2 | 1 |
res3 is the value GetResourceUsage returns for a matmul; res2 is the larger issue-stage hold (16/20 cy) that gates the issue cursor; res9 is the secondary accumulate latch.
QUIRK — the transpose variants of the fp8 formats (
...0109,...010a) drop the matmul hold to{res3:2, res9:1}, the cheapest matmul row in the table — a transposed fp8 matmul reuses an already-loaded staging path and prices nearly free. The non-transpose fp8 rows are the most expensive (res3:8).
The Matpush Reservation Rows
Purpose
The matpush (latch / matprep) family is keyed by a MatpushModifier whose low three bytes are {[0]=GainLatchModeToMatmulDataFormat(mode), [1]=transpose, [2]=1} and whose high byte carries the MSR (matrix-staging-register) variant (0x01 vs 0x03). The constructor inserts 16 keys; GetResourceUsage reads cell array[8].
The rows (all 16 keys)
The reservation widens with the format, falling into three value-sets — narrow {res5/4:1, res7/6:1, res8:2, res10:7}, mid {:3, :2, res8:4, res10:9}, wide {:7, :6, res8:8} — byte-confirmed at @0x1c8bb28d..0x1c8bb737.
MatpushModifier key | variant | staging A · B | res8 (thru) | res10 (latch) | width |
|---|---|---|---|---|---|
0x..010001 | v1/v3 | 1 · 1 | 2 | 7 | narrow |
0x..010101 | v1/v3, xpose | 3 · 2 | 4 | — | mid |
0x..010002 0x..010009 0x..01000a | v1/v3 | 3 · 2 | 4 | 9 | mid |
0x..010102 0x..010109 0x..01010a | v1/v3, xpose | 7 · 6 | 8 | — | wide |
GOTCHA — the staging-port indices depend on the MSR variant the row is built for. The constructor selects
res4/res6for one MSR andres5/res7for the other (a per-variantUnexpected MSRCHECK atmxu_latency_table_gf.cc:94guards the selector). The throughput cell read by the lookup is alwaysres8; the staging holds are what shift. Treating the staging indices as fixed across variants will mis-place a hold.
Throughput per Format — the Conv R[0]/R[1] Inputs
GetResourceUsage forces transpose to 0 when building the lookup key, so the convolution cost model reads the non-transpose rows. The GfcCycleTable::GetCyclesForThroughputHelper @0x1c89f400 binds each cost-table Instruction (CT ordinal) to an LLO opcode, a resource index, and a transpose flag:
| CT | LLO opcode | family | key fmt | res read | cycles |
|---|---|---|---|---|---|
| 0 | 289 (0x121) | matmul | 1 (bf16) | array[3] | 4 |
| 1 | 295 (0x127) | matmul | 2 | array[3] | 8 |
| 2 / 4 | 301 (0x12d) | matmul | 9 (F8E5M2) | array[3] | 8 |
| 3 | 307 (0x133) | matmul | 0xa (F8E4M3Fn) | array[3] | 8 |
| 5 | 324 (0x144) | matpush | 1 (bf16, direct GLM) | array[8] | 2 |
| 6 | 326 (0x146) | matpush | GLM ^0xB (transpose flip) | array[8] | per fmt |
| 7 / 9 | 327 (0x147) | matpush | GLM |0x30 → fmt 9 | array[8] | 4 |
| 8 | 325 (0x145) | matpush | GLM |0x32 → fmt 0xa | array[8] | 4 |
The matpush transpose CTs (11–15) re-enter sub_1C8BDB20 as sub_1C8BDB20(a1, op, opcode, 8, 1) — resource 8, latch_mode = 1 — selecting the transposed (wide) value-set (res8 = 8). The CT→opcode binding for the matmul CTs (0–4) is inferred from the resource index (3) and transpose (0) the helper passes plus the opcode the dispatch consumes; the matpush opcodes (324–327) are byte-confirmed in both the helper and sub_1C8BDB20.
6acc60406 vs VIPERFISH — 6acc60406 thru(matmul) bf16 = 4, fp8 = 8; thru(matpush) bf16 = 2, fp8 = 4. Viperfish (per VF) is bf16 = 8 / int8 = 32 for matmul and 2 / 8 for matpush. The 6acc60406 matmul hold is half VF's — the 256×256 systolic array doubles the per-cycle MAC rate, so the same op occupies the accumulate port for fewer cycles. The matpush narrow hold is unchanged (2), and the wide hold (8) equals the VF int8-wide cell.
The base op latency is a separate MatmulDataFormat → int map at this+0x80, named by the constructor's four try_emplace CHECK strings:
MatmulDataFormat | base latency |
|---|---|
kF32, kBf16 | 211 |
kF8E5M2, kF8E4M3Fn | 204 |
These are the v7 per-format op latencies; the v6e (Ghostlite) constructor stores 192/182 for the same formats, the clearest single number proving the two tables are distinct instances.
The Lookup
Algorithm
GetResourceUsage @0x1c8bdb20 (mis-symbolized as raw_hash_map::at) is the read path. It bounds-checks the resource, dispatches by opcode to a family, builds the key, finds the row, and reads the cell:
function GF_MxuLatencyTable_GetResourceUsage(out, this, instr, res, latch_mode): // @0x1c8bdb20
if res > 0xA: // mxu_latency_table_gf.cc:415
LogFatal("mxu_resource_idx < to_underlying(MxuResource::kNumMxuResources)")
switch (instr.opcode):
case 289: key = MatmulModifier{fmt=1}; map = this+0x20 // matmul
case 295: key = MatmulModifier{fmt=2}; map = this+0x20
case 301: key = MatmulModifier{fmt=9}; map = this+0x20
case 307: key = MatmulModifier{fmt=10}; map = this+0x20
case 324: key = MatpushKey(latch_mode); map = this+0x00 // matpush
case 325: key = MatpushKey(latch_mode | 0x32); map = this+0x00 // → fp8-fnuz (F8E4M3Fn)
case 326: key = MatpushKey(latch_mode ^ 0x0B); map = this+0x00 // transpose flip
case 327: key = MatpushKey(latch_mode | 0x30); map = this+0x00 // → fmt 9
default: return MakeErrorImpl("Unsupported opcode: ...") // :455
entry = map.find(key)
if not entry: throw out_of_range // raw_hash_map::at
array<int,11> res_vector = entry.value // vmovups: [rdx+8] matmul / [rdx+4] matpush
out = res_vector[res] // DIRECT — no remap
The matmul key is {fmt, 0, 0, 0} (transpose forced to 0). The matpush key is assembled from the three shared helpers GainLatchModeToMatmulDataFormat @0x1d629260 (byte[0]), LatchModeIsTranspose @0x1d628ea0 (byte[1]), and LatchOpcodeToMsr(0x91) @0x1c8a1300 (byte[3]); byte[2] is the constant 1. For opcode 0x91, LatchOpcodeToMsr returns 0.
CONTRAST WITH GHOSTLITE — the GF read is
array[res]for the resource the CT helper passes (3 for matmul, 8 for matpush) with no transform. The Ghostlite lookup@0x1c8b7560instead seeds defaults (res4 → idx3,res9 → idx9) before reading. The two generations share the constructor shape and the key helpers, but the read indices and the seed remap differ — bind them per-gen. This is theres-remap 3/8distinction: GF is direct on indices 3 and 8.
The matpush vmovups [rdx+4] vs matmul vmovups [rdx+8] reflects the differing key widths (MatpushModifier precedes its value by 4 bytes, MatmulModifier by 8). The whole array<int,11> is copied to the stack before indexing.
Vlxmr Rows
The Vlxmr map at this+0x40 carries two rows, inserted by sub_1C8BC760(0, &unk_B43CD6C, …) and sub_1C8BC760(257, &unk_B43CD74, …) (keys 0x0 and 0x101, byte-confirmed in the ctor). The insert helper iterates an (int resource, int cycles) pair stream, applying the same resource < 11 bound (CHECK at mxu_latency_table_gf.cc:50). The first pair at @0xb43cd6c reads {res0:2} (byte-confirmed). The full per-row pair set for key 0x101 is UNVERIFIED here — the iterator bounds are not statically pinned from rodata alone; see MXU Latency: GL (Ghostlite) for the sibling Vlxmr shape.
Function Map
| Function | Address | Role |
|---|---|---|
GfcCycleTable::GfcCycleTable | 0x1c89eec0 | allocs 0x30 perf + 0xa0 MxuLatency |
GF MxuLatencyTable ctor | 0x1c8bb1c0 | fills 16 matpush + 16 matmul + 2 vlxmr + cost map |
GF MxuLatencyTable::GetResourceUsage | 0x1c8bdb20 | opcode dispatch + find + direct array[res] |
| matpush insert helper | 0x1c8bc2e0 | array[res & 0xf] = cy, < 11 bound |
| matmul insert helper | 0x1c8bc540 | as above, MatmulModifier map |
| vlxmr insert helper | 0x1c8bc760 | VlxmrModifier rows from rodata pairs |
GfcCycleTable::GetCyclesForThroughputHelper | 0x1c89f400 | CT → opcode/res/transpose dispatch |
GainLatchModeToMatmulDataFormat | 0x1d629260 | matpush key byte[0] |
LatchModeIsTranspose | 0x1d628ea0 | matpush key byte[1] |
LatchOpcodeToMsr | 0x1c8a1300 | matpush key byte[3]; (0x91) → 0 |
ComputeDmaLevels — the Fragment Count Bound to the Matmul Cost
Purpose
The convolution / windowed-transfer cost path multiplies its bandwidth term by a DMA-efficiency factor whose input is the number of descriptor fragments a window decomposes into. windowing_util::ComputeDmaLevels @0x1c86a9e0 computes that decomposition. The result feeds WindowCyclesGenericTargetAgnostic @0x14552180, which turns the fragment count into the efficiency multiplier.
The DmaLevel struct and the level-build loop
ComputeDmaLevels returns a std::vector<DmaLevel>, each element 24 bytes:
struct DmaLevel { // 24 B; vector stride lea[rax+rax*2]<<3
long axis; // +0x00 first window-axis index of this level
long count; // +0x08 init 1, then count *= window_stride[axis] per merged axis
long flags; // +0x10 zero-init; level marker, not read by the cost path
};
The function first CHECKs that the window's per-axis inlined-vectors (window_bounds, pad_high, pad_low, and the dynamic-base second array) all have the same rank, then trims the contiguous minor dim when the window's +0x338 config scalar is ≥ 2 (the row-major innermost run is not counted as its own level):
rank = window_bounds.size()
CHECK(base_bounds.size() == rank && pad_high.size() == rank && pad_low.size() == rank)
N = rank - (WD[+0x338] >= 2 ? 1 : 0) // minor-dim trim
The level-build loop starts a new DmaLevel at each axis and merges following axes into it while all contiguity conditions hold:
for axis in 0 .. N: // outer: start a new level
level = {axis, count=1, flags=0}
for r in axis+1 .. N: // inner: try to merge axis r
if elemental_stride[r] != 1: break
operand = stride_level_operand[r] // WD + 0xe0 [r], an LloValue*
if operand:
if operand.opcode >= 0x1cd: fatal()
rt = opcode_produced_register_type[operand.opcode]
if rt != 2 && rt != 4: fatal() // ProducesSreg() || ProducesVreg()
contiguous = llo::KnownEq(window_stride[r], operand)
else:
contiguous = (window_stride[r] == base_bounds[r])
if not contiguous: break
if pad_low[r] != 0 || dilation[r] != 0: break
level.count *= window_stride[r] // "access_multiplier"
emit(level) // a finalized contiguity break
NOTE — the
+0xe0per-axis second array (stride_level_operand) is the DMA-level descriptor operand, distinct fromwindow_stride(which drives the byte count). Its register class gates the merge: the LLO value must produce a Scalar (opcode_produced_register_type == 2,ProducesSreg()) or a Vector (== 4,ProducesVreg()) — a predicate (1), a mask (3), or a no-result op (0) fails the gate and forces a contiguity break. The CHECK atllo_util.h:35(value->ProducesSreg() || value->ProducesVreg()) names this exactly. The canonical conv stride operands (kScalarAddressCalculation@0x86etc.) are all Scalar, hence mergeable; a mask-typed stride operand fragments the DMA.
The number of emitted DmaLevels equals the number of contiguity breaks; the product of their count fields is the fragment count.
The fragment count → efficiency multiplier
WindowCyclesGenericTargetAgnostic @0x14552180 reads the level vector and buckets the fragment product v18 (the running access_multiplier from the merge loop), byte-exact:
fragment product v18 | multiplier | rodata |
|---|---|---|
≤ 1 level (no fragmentation) | 1.0 | @0xa2df230 |
== 1 | 1.6 | @0xa2d71a0 |
∈ [2, 3] | 1.3 | @0xa2d71a8 |
∈ [4, 7] | 1.1 | @0xa2df4c8 |
∈ [8, 31] | 1.05 | @0xa2df2a8 |
> 31 | 1.0 (falls through to default) | @0xa2df230 |
All four read doubles are byte-confirmed in .rodata: 1.0 (@0xa2df230), 1.6 (@0xa2d71a0), 1.3 (@0xa2d71a8), 1.1 (@0xa2df4c8), 1.05 (@0xa2df2a8). The {1.6, 1.3} pair is a 2-entry table at @0xa2d71a0 indexed by setge(v18 ≥ 2); 1.1 and 1.05 are standalone constants. The bandwidth deposit returned is (byte_count · elem_size / bytes_per_cycle) · multiplier + residual.
GOTCHA — A fragment product
> 31receives no penalty multiplier — that branch (if (v18 > 31)) jumps past the bucket table and keeps the1.0default. The2003.0constant at@0xa2df1b0(byte-confirmed) is not a fragment-count bucket: it is selected at function entry by an element-type predicate (window descriptor operand present and(*(int16*)(op+11) & 0x7C) == 0x10), independent ofComputeDmaLevels, and it scales thebytes_per_cycledivisor, not the fragment multiplier. Do not bind2003.0to the fragment count.
QUIRK — a single contiguous transfer (
≤ 1level) is multiplier1.0, but a fragment product of exactly1(one level whose merged element-count product is 1) is1.6. The two are distinct paths: the≤ 1-level default (v60 <= 1) short-circuits before the bucket table; thev18 == 1bucket is reached only when the level vector is non-trivial. A reimplementation must keep both.
Worked Example — 6acc60406 bf16 Conv Deposit
For a 6acc60406 bf16 convolution, the per-format MXU reservation reads are now numeric:
R[0]matpush = matpush_count · thru(CT5)[bf16] = matpush_count ·array[8]= matpush_count · 2.R[1]matmul = op_count · thru(CT0)[bf16] · 0.5 / Target / EPF = op_count ·array[3]· 0.5 / … = op_count · 4 · 0.5 / ….R[2]Xlu =4· ChunksPerTile · rem (theGhostlitePerformancegrid cell, Performance: GF).
For fp8 (F8E5M2/F8E4M3Fn), R[0] = 4 and R[1] = 8 — double the bf16 hold. If the conv input window fragments into 2 DMA levels with a per-level element product of 3 (a 3-tap kernel axis with non-unit stride), ComputeDmaLevels returns 2 levels, v18 = 3, and the bandwidth deposit is multiplied by 1.3; a fully contiguous load multiplies by 1.0; a fragment product > 31 also falls back to 1.0 (no fragment-count penalty — the steering away from heavily strided layouts comes from the larger byte count, not a multiplier).
Related Components
| Name | Relationship |
|---|---|
mxu-latency-overview | the shared model, key scheme, and GetResourceUsage read path |
mxu-latency-gl | the Ghostlite array<int,11> — same shape, distinct ctor, 192/182 base latency |
mxu-latency-vf | the Viperfish array<int,19> — double the matmul hold |
matmul-mode-modifiers | the modifier ordinals, format codes, GLM transforms |
performance-gf-ghperf | the GF grid carrying the conv R[2] Xlu cell and the same per-dtype magnitudes |
window-description-cost | the windowed-transfer cost path that consumes ComputeDmaLevels |
Cross-References
- MXU Latency Overview — the
MxuResourcemodel, key construction, and the byte-sharedGetResourceUsage - MXU Latency: GL (Ghostlite) — the v6e
array<int,11>; theres4→3/res9→9seed remap GF lacks - MXU Latency: VF — the Viperfish
array<int,19>; the half-VF matmul hold contrast - MatmulMode & Modifiers — the
MatpushModifier/MatmulModifierordinals and theGainLatchMode→ format jump table - Performance: GF (GhPerf 465×31) — the 6acc60406 occupancy grid and the conv
R[2]Xlu = 4 cell - MxuOpHoldIssues Stall Recurrence — how the reservation arrays become issue stalls
- Resource Enum (23-slot) — the higher-level
ResourceVector, distinct from the MXU-internalMxuResource - Window Description Cost — the windowed-transfer cost path that calls
ComputeDmaLevels - MXU Slot — the LLO MXU instruction slot these ops price
- Matprep / IAR / Latch — the matprep / latch ops behind the matpush family