Performance: PF
Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel. Other versions differ.
Abstract
PufferfishPerformance is the first of the heap-grid Performance objects — the architecture that Pufferfish (v4) introduced and that Viperfish/Ghostlite/6acc60406 inherit. Where the older Jellyfish/Dragonfish family stores its costs in a single inline POD with an offset LUT, Pufferfish heap-allocates two objects: a flat per-instruction latency array (336 int32, indexed by Performance::Instruction) and a two-dimensional Instruction × Resource occupancy grid (336 rows × 20-wide std::vector<int>). GetLatency(instr) reads the array; GetResourceUsage(instr, res) reads grid[instr][res] — the cycle count for which that instruction holds micro-pipeline port res. This is the libtpu analog of an LLVM SchedMachineModel, recovered by decoding the constructor that fills it (PufferfishPerformanceC1 @0x1c8be080) rather than read from a .td file.
Two facts make Pufferfish the pivot of the family. First, its read path — outer bound, 24-byte row stride, inner bound, row.data[res] — is byte-identical to the Viperfish/Ghostlite/6acc60406 GetResourceUsage; one description (performance-overview) covers all four grid generations, and this page supplies only the PF-specific widths, cells, and the pre-Ghostlite column layout. Second, Pufferfish is the last generation with a BarnaCore embedding engine: LatencyTablePufferfish prices a variant<PufferfishPerformance, PufferfishBarnaCorePerformance>, so an instruction is dispatched to one of two distinct grids by its variant tag. The TensorCore grid (variant 0) is what this page dumps; the BarnaCore grid (variant 1, PufferfishBarnaCorePerformance) is a separate 0x30-byte object reached through the same ResourceUsageFromInstruction visitor.
The PF grid has 20 resource columns — the narrowest of the grid family (the progression is 20 → 28 → 31 → 31 across PF → VF → GL → GF). This page documents the 20-column layout named by occupant LLO class, the latency-array value distribution, the EUP push/pop occupancy, and the one structural divergence from every later gen: Pufferfish has no direct Xlu grid column and no separate MxuLatencyTable — its matmul/transpose occupancy is priced by the grid's matmul-throughput column (res 9) plus a separate XluConflictPenaltyTable. See mxu-latency-pf for that divergence in full.
For reimplementation, the contract is:
- The two-object layout:
int32 latency[336]heap array + a 336-row grid of 20-widestd::vector<int>rows, with exactnewsizes and field offsets. - The
GetResourceUsage(instr, res)/GetLatency(instr)read paths and their two bounds checks. - The 20-column
Performance::Resourcelayout, named by the LLO instruction class that deposits into each — in particular res 9 (matmul throughput) and res 6 (matrix-result/Xlu deposit). - How an LLO opcode reaches a grid row through
GetPufferfishInstruction, and how the MXU band fans a single matmul/matprep opcode out to many ordinals. - The BarnaCore variant: the second grid (
PufferfishBarnaCorePerformance) the sameLatencyTablePufferfishprices.
| Class | xla::pufferfish::PufferfishPerformance (non-polymorphic value type) |
| Ctor | PufferfishPerformanceC1 @0x1c8be080 (object 0x30 B; ~2376-line ctor) |
| Latency array | new 0x540 = 1344 B = 336 int32, memset(_, 0xff, 1344) then every slot overwritten |
| Grid | new 0x1f80 = 8064 B = 336 rows × 24 B (std::vector<int>), each row new 0x50 = 20 int32 |
| Read path | GetResourceUsage @0x1c8c3880 · GetLatency @0x1c8c3860 · GetResources @0x1c8c3840 |
| Resource count | 20 (kResources @0xb43cd94, 20-byte permutation of {0..19}) |
| Populated cells | 265 grid cells across 180 rows; all 336 latency slots written |
| Classifier | GetPufferfishInstruction @0x1c8a1fe0 (LloValue → Instruction, variant-tagged) |
| Xlu pricing | res 6 (kVectorMatres) + XluConflictPenaltyTable via XposeXLUReservationLatency @0x1c8a13e0 — no direct Xlu grid column |
| Singleton | GetSharedPufferfishPerformance::pf_shared @0x22579a10 |
Object Layout
Purpose
PufferfishPerformance answers the scheduler's two per-instruction questions — pipeline depth (GetLatency, raised on a true-dependency edge) and per-port occupancy (GetResourceUsage, accumulated into per-bundle throughput pressure). It stores both in heap objects so the same class shape scales to wider resource sets in later gens.
Structure
The constructor @0x1c8be080 allocates and fills two heap regions. The object itself is small — operator new(0x30) in LatencyTablePufferfish — holding only the array/grid pointers and their counts:
struct PufferfishPerformance { // object 0x30 B; built by PufferfishPerformanceC1 @0x1c8be080
int32* latency; // +0x00 ; new 0x540 = 336 int32, memset 0xff
u64 latency_size; // +0x08 ; = 336
u64 latency_cap; // +0x10 ; = 336
vector<int>* grid; // +0x18 ; new 0x1f80 = 336 × 24-B vector<int>
u64 grid_outer_count; // +0x20 ; = 336 (the GetResourceUsage outer bound)
u64 grid_outer_cap; // +0x28 ; = 336
// each row is a 24-B std::vector<int>: { int* data (new 0x50 = 20 int32, zero-init), size=20, cap=20 }
};
The row-allocation loop is byte-exact in the ctor: for (i = 16; i != 8080; i += 24) allocates operator new(0x50) per row, zero-fills it (vmovups of ymm0 over the 80 bytes), and sets [row+8] = [row-8+24] = 20 (the std::vector<int> size and capacity). The latency array is memset to 0xff (255) and then every slot is overwritten — unlike Ghostlite/6acc60406 where the 0xff default survives on unpriced rows. So a reimplementation must write all 336 PF latency slots; none fall back to a sentinel.
NOTE — the latency-array
memsetvalue is0xffhere (255), confirmed bymemset(v4, 255, 1344)in the ctor — distinct from the flat Jellyfish family, whose POD is memset to the0x7fffffffINT_MAX sentinel. On PF the sentinel never reaches a consumer because all 336 slots are written.
Sample cells (byte-anchored)
The ctor writes latency by [latency + 4*instr] and grid cells by [grid + 24*instr] then [row.data + 4*res]. Three cells fix the layout against the raw bytes:
latency[0x67] = 7 ; store [latency + 0x19C], 0x19C = 0x67*4 (rsqrt EUP push)
latency[0x7A] = 83 ; store [latency + 0x1E8], 0x1E8 = 0x7A*4 (matmul band base, format A)
grid[0x67][2] = 1 ; row 0x9A8 = 0x67*24, [row.data + 8] = res 2 (EUP-prep stage A)
grid[0x67][3] = 2 ; row 0x9A8, [row.data + 12] = res 3 (EUP-prep stage B)
grid[0x7A][9] = 8 ; row 0xB70 = 0x7A*24, [row.data + 36] = res 9 (matmul throughput)
The Read Path
Algorithm
GetResourceUsage @0x1c8c3880 is the single accessor for a grid cell. It takes the Instruction ordinal and the Resource column, bounds-checks both, computes the 24-byte row stride with a lea, and reads the cell:
function PufferfishPerformance_GetResourceUsage(perf, instr, res): // @0x1c8c3880
if perf.grid_outer_count <= instr: // [perf+0x20] = 336 ; outer bound
BUG() // trap (ud1 / BUG)
row = perf.grid // [perf+0x18]
v4 = 3 * instr // row = grid + instr*24 (3*instr then *8)
if row[v4*8 + 8].size <= res: // [row+8] = inner bound = 20
BUG()
return *(int*)(row[v4*8].data + 4*res) // grid[instr][res]
function PufferfishPerformance_GetLatency(perf, instr): // @0x1c8c3860
if perf.latency_size <= instr: // [perf+0x8] = 336
BUG()
return *(int*)(perf.latency + 4*instr) // latency[instr]
Both bounds checks trap (BUG()) on overflow rather than returning a default — there is no safe out-of-range read. GetResources @0x1c8c3840 returns the kResources traversal order @0xb43cd94 (20 bytes, a permutation of {0..19}), the order in which a row is iterated to sum its per-port holds.
GOTCHA — the outer index is the per-gen
Performance::Instruction, not the raw LLO opcode.GetPufferfishInstruction@0x1c8a1fe0classifies anLloValueinto theInstructionordinal (jump table@0xb43927c, index = opcode − 2), and the MXU band fans a single matmul/matprep opcode out to ~0x60 ordinals via a secondary latch-mode WORD table. A reimplementation that indexes the grid directly by LLO opcode mis-reads every MXU row.
The variant dispatch — TensorCore vs BarnaCore
LatencyTablePufferfish does not hold one Performance; it holds two singletons — pf_shared (PufferfishPerformance, [table+0x1d0]) and pf_bc_shared (PufferfishBarnaCorePerformance, [table+0x1d8]) — and prices an instruction through a variant<PufferfishPerformance::Instruction, PufferfishBarnaCorePerformance::Instruction>. ResourceUsageFromInstruction dispatches the variant:
// variant index 0 — TensorCore op (@0x1c8a3180)
return PufferfishPerformance::GetResourceUsage(table.pf_shared, instr, res);
// variant index 1 — BarnaCore op (@0x1c8a31a0)
if (instr.tag == 2):
return PufferfishBarnaCorePerformance::GetResourceUsage(table.pf_bc_shared, instr, 0);
else:
return 0;
So a single PF cost query routes to one of two grids by variant tag. This page documents variant 0 (the TensorCore grid). The BarnaCore grid (PufferfishBarnaCorePerformance::GetResourceUsage @0x1c8c4800, ctor near @0x1c8c38c0) is a separate 0x30-byte object with its own latency array and grid; it is the last generation's embedding-engine cost model and is not dumped here.
QUIRK — Pufferfish is the only grid generation that prices two distinct cores from one
LatencyTable. Viperfish onward have a singlePerformanceper gen. A reimplementation that models PF with one grid will silently mis-cost the 13 BarnaCore (variant-1) opcodes thatGetPufferfishInstructiontags with high-16 = 1.
The 20 Resource Columns
Purpose
The grid's inner axis is the PufferfishPerformance::Resource enum — the intra-op micro-pipeline reservation ports (EUP-prep, matmul/matprep throughput, cross-lane/transpose result, sync, etc.). The enum has no ToString in the binary, so the columns are named functionally: by reading which LLO-instruction class deposits cycles into each (the outer index being the opcode) and anchoring against the named accessor XposeXLUReservationLatency. These names are reimplementation-grade in meaning (which physical port each column reserves), not literal symbol strings.
NOTE — this 20-column
Performance::Resourceenum is a different, lower-level enum than the 23-slot per-bundleResourceVector(resource-enum). The grid prices intra-op micro-pipeline-stage holds; theResourceVectoris the per-bundle functional-unit accumulator the higher-level cost model deposits into.kResources@0xb43cd94gives the grid's column traversal order, not theResourceVectorslot order.
The columns
kResources @0xb43cd94 lists the 20 columns in fill order 13 11 08 00 05 02 03 12 0a 06 09 0e 04 01 0b 10 07 0c 0d 0f (all of {0..19} exactly once). Naming each by its dominant occupant LLO class (cell count, value-set, occupant):
| Col | Cells | Value(s) | Occupant LLO band (classifier-named) | Physical port (functional) |
|---|---|---|---|---|
| r0 | 1 | 5 | kDma (instr 0x32) | DMA address/issue port A |
| r1 | 1 | 5 | kDma (instr 0x32) | DMA address/issue port B |
| r2 | 6 | 1 | EUP push 0x67..0x6c (rsqrt/pow2/log2/tanh/recip/erf) | EUP transcendental-prep stage A |
| r3 | 6 | 2 | same EUP push band | EUP transcendental-prep stage B |
| r4 | 2 | 7, 9 | kVectorSetRngSeed (0x6d), kVectorPrng (0x6f) | RNG seed / PRNG setup port |
| r5 | 1 | 1 | kVectorEupResult (0x76 = EUP pop) | EUP-result-pop drain port |
| r6 | 1 | 8 | kVectorMatres (0x77) | matrix-result (Xlu) deposit port |
| r7 | 1 | 8 | kVectorXlaneResult/permute/transpose-result (0x78) | cross-lane / transpose result port |
| r8 | 1 | 2 | kVectorCmemResult (0x79) | Cmem-result port |
| r9 | 96 | 8, 16 | MXU matmul band (0x7a..0xd9; latch-mode ords) | MXU matmul throughput port (×96) |
| r10 | 20 | 7 | MXU matprep band (0xdc..0xef) | MXU matprep throughput port |
| r11 | 54 | 1, 8, 16 | matprep; SetSegmentPattern(0xf3); transpose-binary(0xfc); permute(0x104); rotate(0x106) | matprep / transpose-binary result A |
| r12 | 34 | 1, 8, 16 | same transpose/permute/rotate band | transpose / permute result B |
| r13 | 4 | 47, 55 | kVectorPermute(0x104), kVectorRotate(0x106) | permute/rotate extended-result stage |
| r14 | 10 | 18 | kVector{Add,Max,Min,..}ReduceF32 (0x108..0x111) | reduce-result stage A |
| r15 | 10 | 57 | same reduce band | reduce-result stage B |
| r16 | 11 | 1 | kVectorSyncFlag{Set,Add}* / kVectorWait* (0x113..) | sync-flag / wait port |
| r17 | 2 | 4 | kVectorCmemStore (0x135..0x136) | Cmem-store port |
| r18 | 3 | 4 | kVectorSetIar{Lane,Sublane,Raw} (0x14b..0x14d) | SetIar (index-addr-reg) port |
| r19 | 1 | 2 | kVectorCmemLoad (0x14f) | Cmem-load port |
Per-column cell counts (sum 265): r0:1 r1:1 r2:6 r3:6 r4:2 r5:1 r6:1 r7:1 r8:1 r9:96 r10:20 r11:54 r12:34 r13:4 r14:10 r15:10 r16:11 r17:2 r18:3 r19:1.
The matmul-throughput column (res 9) and the matrix-result column (res 6)
Res 9 is the single matmul-throughput port — 96 cells, value-set {8, 16}, occupying the entire MXU matmul band (Instruction 0x7a..0xd9). The whole per-format matmul throughput collapses into this one column on PF; the later gens spread it across a multi-stage matprep group (VF res 3 + r4..r7, GL/GF the wider EUP-AndPop FIFO). The cell value (8 for the narrow format, 16 for the wide) is the per-issue throughput hold a matmul of that format imposes.
Res 6 is the matrix-result deposit (kVectorMatres 0x77, value 8) — the PF analog of the conv R[2] Xlu term. It is the deposit column, but PF does not read it through a GetXluPathReservation accessor the way VF/GL/GF read their Xlu column. Instead, the conv/transpose Xlu reservation is priced by a separate XluConflictPenaltyTable — the structural divergence detailed below and on mxu-latency-pf.
QUIRK — the MXU matmul band ordinals 0x7a..0xd9 carry no jump-table arm in
GetPufferfishInstruction— they are reached only via the secondary latch-mode WORD table that fans a single matmul/matprep opcode out across the band. The grid rows exist and carry cells, but the row identity is the(matmul-opcode, GainLatchMode)pair, not a directly-classified opcode. A reimplementation must reproduce the latch-mode fan-out, not just the direct-arm classifier.
The Latency Array
Both the latency array and the EUP push/pop edge are read straight from the ctor stores. The 336-entry array has 14 distinct values:
| Latency | Count | Meaning |
|---|---|---|
| 1 | 145 | cheap vector / scalar ops |
| 83 | 48 | MXU matmul base latency, format A |
| 101 | 48 | MXU matmul base latency, format B |
| 2 | 30 | sync / wait |
| 7 | 26 | EUP push + matprep |
| 126 | 16 | transpose-binary |
| 79 | 10 | reduce |
| 5 | 4 | (mid-cost) |
| 4 | 2 | (mid-cost) |
| 69 | 2 | permute |
| 77 | 2 | rotate |
| 3 | 1 | (single op) |
| 30 | 1 | (single mid-cost op) |
| 53 | 1 | Cmem-load |
The two MXU base latencies — 83 and 101 — are the two matmul data-format pipeline depths (the format-A vs format-B path); they sit at the head of the matmul band (latency[0x7A]=83, byte-confirmed). The EUP push latency is uniform at 7 across all six classified F32 EUP functions (rsqrt = pow2 = log2 = tanh = recip = pushErf, Instruction 0x67..0x6c); the EUP pop (Instruction 0x76) is 1.
The EUP push grid occupancy
Pufferfish is distinctive in that its EUP push reserves grid ports in addition to its 7-cycle latency: each EUP push row writes grid[push][2] = 1 and grid[push][3] = 2 — two EUP-prep micro-pipeline ports (stage A and stage B). Viperfish, by contrast, prices its EUP push (latency 6) through the latency array alone and reserves no grid cells. The PF EUP unit is also half-rate: PufferfishTarget::VectorEupReservationCycles @0x1d494cc0 returns 2 (VF returns 1).
| Gen | EUP push Instruction | Push latency | Pop Instruction | Pop latency | Push grid occupancy | VectorEupReservationCycles |
|---|---|---|---|---|---|---|
| PF | 0x67..0x6c | 7 | 0x76 | 1 | r2:1, r3:2 | 2 (half-rate) |
| VF | 0xcc..0xd2 | 6 | 0x168 | 1 | (none) | 1 |
NOTE — the push→pop dependency edge weight is
GetLatency(push) = 7on PF, returned unmodified — it is not scaled byVectorEupReservationCycles. The reservation (PF = 2, the half-rate EUP issue rate) is an orthogonal axis bounding push→push spacing, not the push→pop window; the two compose as amax, never a product (see EUP Latency Overview and EUP Per-Gen Integers). On VF the edge is 6 with reservation 1. A reimplementation that multiplies the 7-cycle latency by the 2-cycle reservation over-costs every PF transcendental; one that models the EUP as full-rate (reservation 1) on PF under-costs back-to-back chains.
The Pufferfish Xlu Divergence
Every later grid generation reads an explicit Xlu/matrix-result column through a GetXluPathReservation accessor (VF res 0x0e, GL res 0x0f, GF res 0x10). Pufferfish does not. Its LatencyTablePufferfish ctor @0x1c8a1960 builds an XluConflictPenaltyTable (at the table's +0x18 inner object) via 10 SetXluConflictPenaltyBetween calls, and prices the conv/transpose Xlu reservation through XposeXLUReservationLatency @0x1c8a13e0:
function XposeXLUReservationLatency(table, mode, earlier_type, later_type, a, b): // @0x1c8a13e0
CHECK(IsTranspose(earlier_type)) // latency_table_pf.cc:62
v = (b - a) + XluConflictPenaltyBetween(&table.penalty, // +0x18 inner table
earlier_type, later_type_lo, later_type_hi)
if v < -5: v = -6
return v + 7
The XluConflictPenaltyTable is a 3-axis integer table, indexed [XluInstrType][lo<6][hi<3] with stride 72*type + 12*lo + 4*hi + 8 (XluConflictPenaltyBetween @0x1c8a0180); stored values are penalty + 1. IsTranspose(type) is (type - 2) < 3 — XluInstrType ordinals {2,3,4} are the transpose ops. The 10 penalty pairs the PF ctor installs (e.g. (0→2)=56, (2→5)=96, (2→0)=86, each duplicated for the second hi plane) are the conflict cycles between a transpose-result producer and a later Xlu consumer.
This is the per-gen structural difference: PF prices Xlu by a conflict matrix, VF/GL/GF by a grid column. It is the same architectural divide that separates the matmul occupancy model — Pufferfish has no separate MxuLatencyTable reservation object either, folding MXU occupancy into the grid (res 9) + this penalty table.
How an Opcode Reaches a Row — GetPufferfishInstruction
GetPufferfishInstruction @0x1c8a1fe0 (anonymous namespace) maps an LloValue* to a Performance::Instruction ordinal. It is a jump table @0xb43927c (index = LloOpcode − 2, bound 0x1ca) whose arms mov eax, IMM32 where the low 16 bits are the Instruction ordinal and the high 16 are the variant tag (0 = TensorCore, 1 = BarnaCore). The bulk of the ~195 priced opcode arms resolve to variant-0 ordinals; exactly 13 arms set the high-16 variant tag to 1 (BarnaCore — opcodes 0x1ac-0x1ad, 0x1af-0x1b5, 0x1b8, 0x1c9/0x1ca, 0x1cc; e.g. kBarnaCoreVectorLoad 0x1c9 returns 0x10084 → ins 0x84, var1). The remaining 9 arms (e.g. 0x1ae) are LogMessageFatal "unsupported for Pufferfish" stubs. The MXU band is the exception: the matmul/matprep arm reads a secondary latch-mode WORD table that expands a single matmul/matprep opcode into the ~0x60-ordinal band the grid prices per data format.
| Stage | Mechanism | Anchor |
|---|---|---|
| LloValue → opcode | read WORD[value] | inline |
opcode → Instruction + variant | jt @0xb43927c, idx = op − 2 | GetPufferfishInstruction @0x1c8a1fe0 |
| MXU opcode → band ordinal | secondary latch-mode WORD table (per GainLatchMode) | matmul arm in the classifier |
Instruction → latency / cells | latency[instr] / grid[instr][res] | @0x1c8c3860 / @0x1c8c3880 |
NOTE — the band/result-pop rows reached via a shared-epilogue jump (rather than a direct
mov-IMM arm) were named functionally by occupant LLO class, not by following the jump chain ordinal-by-ordinal (MEDIUM confidence on those rows). The band identity (MXU matmul/matprep), its base latencies (83/101), and its throughput cells are byte-exact; the per-ordinal(opcode, latch_mode)mapping inside the band is mechanism, not enumerated.
Function Map
| Function | Address | Role |
|---|---|---|
PufferfishPerformance::PufferfishPerformance | 0x1c8be080 | ctor — fills 336-entry latency array + 336×20 grid (265 cells) |
PufferfishPerformance::GetResourceUsage | 0x1c8c3880 | grid read — outer/inner bound + 24-B stride + row.data[res] |
PufferfishPerformance::GetLatency | 0x1c8c3860 | latency[instr], bound [perf+8] |
PufferfishPerformance::GetResources | 0x1c8c3840 | returns kResources traversal order (count 20) |
kResources (PF) | 0xb43cd94 | 20-byte column traversal order, permutation of {0..19} |
GetPufferfishInstruction | 0x1c8a1fe0 | LloValue → Instruction + variant tag (jt @0xb43927c) |
LatencyTablePufferfish::LatencyTablePufferfish | 0x1c8a1960 | ctor — installs both grids + the XluConflictPenaltyTable |
ResourceUsageFromInstruction (variant 0) | 0x1c8a3180 | TensorCore → PufferfishPerformance::GetResourceUsage |
ResourceUsageFromInstruction (variant 1) | 0x1c8a31a0 | BarnaCore → PufferfishBarnaCorePerformance::GetResourceUsage |
PufferfishBarnaCorePerformance::GetResourceUsage | 0x1c8c4800 | BarnaCore grid read (variant-1 ops) |
XposeXLUReservationLatency | 0x1c8a13e0 | conv/transpose Xlu pricing via penalty table |
XluConflictPenaltyBetween | 0x1c8a0180 | 3-axis penalty-table read [type][6][3] |
XluConflictPenaltyTable::IsTranspose | 0x1c8a04e0 | (type - 2) < 3 |
PufferfishTarget::VectorEupReservationCycles | 0x1d494cc0 | = 2 (half-rate EUP) |
GetSharedPufferfishPerformance singleton | 0x22579a10 | pf_shared TensorCore grid instance |
Related Components
| Name | Relationship |
|---|---|
performance-overview | the family framing — the shared grid layout + byte-identical read path |
performance-vf / -gl-ghperf / -gf-ghperf | the wider 28/31-column grids that inherit this layout |
mxu-latency-pf | the PF MXU occupancy correction — no separate MxuLatencyTable, grid res 9 + conflict-penalty |
resource-enum | the 23-slot per-bundle ResourceVector, distinct from the 20-column Performance::Resource |
matmul-mode-modifiers | the matmul format codes the MXU band rows are keyed on |
Cross-References
- Performance Overview — the two
Performancearchitectures, the shared grid layout, and the resource-count progression 20→28→31→31 - Performance: VF — the 28-column Viperfish grid and the
GetXluPathReservationres 0x0e PF lacks - Performance: GL (GhPerf 476×31) — the Ghostlite grid, Xlu deposit res 0x0f
- Performance: GF (GhPerf 465×31) — the
6acc60406grid, Xlu deposit res 0x10 - MXU Latency: PF — why Pufferfish has no separate
MxuLatencyTableand prices MXU occupancy through this grid + the conflict-penalty table - MXU Latency Overview — the per-gen
MxuLatencyTablereservation model the later gens carry - Resource Enum (23-slot) — the higher-level per-bundle
ResourceVector, distinct from this micro-pipelineResourceenum - MatmulMode & Modifiers — the
MatmulDataFormatcodes the matmul band ordinals encode - MXU Slot — the LLO MXU instruction slot whose ops the matmul band (res 9) prices