Performance: GF (GhPerf 465×31)
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (buildlibtpu_lts_20260413_b_RC00, BuildID md589edbbe81c5b328a958fe628a9f2207d). The binary is not stripped. Section map:.text/.rodataVMA == file offset;.data.rel.roVMA − 0x200000 == file offset;.dataVMA − 0x400000 == file offset. Every cell integer, allocation size, andkResourcesbyte on this page was read directly from the GF constructor and the binary's.rodata.
Abstract
This page dumps the 6acc60406 (TPU7x / "GF" / gfc) GhostlitePerformance occupancy grid — the per-generation Instruction × Resource 2D table that prices how many cycles each LLO instruction holds each intra-op micro-pipeline port. It is the libtpu analog of an LLVM WriteRes/ProcResource matrix, except it is reconstructed by reading the constructor that fills it, not declared in a .td file. The grid is 465 × 31: 465 GhostlitePerformance::Instruction rows (the per-gen classifier output, == LloOpcode for every priced row) and 31 GhostlitePerformance::Resource columns (the EUP / cross-lane / Xlu / matrix-result micro-pipeline stages). Alongside it sits a flat 465-entry latency array indexed by the same Instruction. Both are heap objects owned by GfcCycleTable.
The GF grid is the gfc sibling of the Ghostlite (GL, v6e, 476-row) grid documented on Performance: GL: they are two instances of one GhostlitePerformance class — same 0x7c-byte 31-int rows, same byte-identical GetResourceUsage/GetLatency/GetResources methods, and the same single kResources static at @0xb43cdc4 (there is exactly one GhostlitePerformance::kResources symbol; GL and GF do not have distinct permutations). They differ in row count (465 vs 476 — 6acc60406 drops 11 v6e shift/saturate and BarnaCore-wait opcodes), in cell values and base latencies (GF EUP/transcendental ops latency 212/204 vs GL 192/182), and in being built by a distinct constructor (sub_1C8D3740 for the GfcCycleTable instance vs GhostlitePerformanceC1 @0x1c8cbc80 for the GlcCycleTable instance). The one runtime divergence in the Xlu read is not in the grid but in the cost-table helper: GfcCycleTable::GetCyclesForThroughputHelper reads res 0x10 (16) while GlcCycleTable's reads res 0x0f (15) — both indexing the same shared grid object.
This page is strictly the GhostlitePerformance grid. The companion GfcCycleTable MxuLatencyTable (the per-modifier × MxuResource reservation matrix, matmul/matpush throughput, and ComputeDmaLevels) is on MXU Latency: GF; it is referenced here and not duplicated. The single number this grid contributes to the cost model is the convolution R[2] Xlu term: grid[mxres-instr][res 0x10] = 4, completing the 6acc60406 conv cost triple that MXU Latency: GF supplies R[0]/R[1] for.
For reimplementation, the contract is:
- The
GhostlitePerformanceobject layout (latency array + 2D grid ofstd::vector<int>rows), the GF allocation sizes (new 0x744/new 0x2b98/new 0x7c), and the0x1d1counts. - The
GetResourceUsage(instr, res)read path: two bounds checks, the 24-byte row stride, the directrow.data[res]read; byte-identical to the GL accessor and to its inline twin@0x1c8da1a0. - The 31 resource columns named by occupant LLO class, the row-classification into instruction bands, and the Xlu deposit column (res
0x10= 4). - The
0xff-default fallback: only ~92 of 465 rows are written; the rest keeplatency = 255and are priced by the gen-invariantCycleTable::GetResourcepath, not the grid. - The GF-vs-GL grid differences (row count, base latencies, the constructor split, and the per-helper Xlu res index — not a difference in
kResources, which is shared).
| Class | xla::ghostlite::GhostlitePerformance (GF variant; constructor unnamed/mis-symbolized) |
| Owner / allocation | GfcCycleTable this+0x10 — operator new(0x30) then sub_1C8D3740 |
| Ctor | @0x1c8d3740 — latency new 0x744 (465 int32, memset 0xff), grid new 0x2b98 (465 × 24), rows new 0x7c (31 int32) |
| Read path | GhostlitePerformance::GetResourceUsage @0x1c8d3700; inline twin @0x1c8da1a0 |
| Latency read | GhostlitePerformance::GetLatency @0x1c8d36e0 |
| Column order | GhostlitePerformance::GetResources @0x1c8d36c0 → kResources @0xb43cdc4 (31 bytes, perm of 0..30) — shared by GL and GF |
| Grid shape | 465 rows (0x1d1) × 31 resources (0x1f); 285 populated cells across 92 rows |
| Xlu deposit column | res 0x10 (16) — cell = 4 for every matrix-result / Cmem / transpose-result op (conv R[2]) |
| Default | unwritten rows: latency 0xff = 255, grid cells 0 — priced by CycleTable::GetResource, not the grid |
Object Layout and Build
Purpose
The grid is built exactly once, when GfcCycleTable is constructed. The cycle table heap-allocates two objects: the GhostlitePerformance grid at this+0x10 and the MxuLatencyTable at this+0x18. Each has its own constructor. This page covers the first.
// GfcCycleTable::GfcCycleTable @0x1c89eec0
v4 = operator new(0x30u); sub_1C8D3740(v4); // GhostlitePerformance grid → this+0x10
v6 = operator new(0xA0u); sub_1C8BB1C0(v6); // MxuLatencyTable → this+0x18
The 0x30-byte object is the six-qword GhostlitePerformance header (two {ptr, size, cap} triples); the grid body is allocated separately on the heap.
Structure
GhostlitePerformance is the shared grid-family layout (identical across PF/VF/GL/GF, only widths differ — see Performance Family Overview):
struct GhostlitePerformance { // 0x30-byte header, body on the heap
int32* latency; // +0x00 ; new 0x744 = 1860 B = 465 int32, memset 0xff
u64 latency_size; // +0x08 ; = 0x1d1 = 465
u64 latency_cap; // +0x10 ; = 0x1d1
vector<int>* grid; // +0x18 ; new 0x2b98 = 11160 B = 465 × 24
u64 grid_outer_count; // +0x20 ; = 0x1d1 = 465 (the GetResourceUsage outer bound)
u64 grid_outer_cap; // +0x28 ; = 0x1d1
// each grid row is a 24-B std::vector<int>:
// { int* data (new 0x7c = 124 B = 31 int32, zero-init), size = 0x1f = 31, cap = 0x1f }
};
The constructor sub_1C8D3740 does not decompile (Hex-Rays returns no cfunc for this 27 143-byte fill function); every fact below is read from its disassembly. The opening sequence is byte-exact:
; sub_1C8D3740 @0x1c8d3740 (head)
new(0x744) ; latency array → [rbx+0]
[rbx+0x10] = 0x1d1 ; latency cap = 465
memset(latency, 0xff, 0x744)
[rbx+0x08] = 0x1d1 ; latency size = 465
new(0x2b98) ; grid body → [rbx+0x18]
[rbx+0x28] = 0x1d1 ; grid cap = 465
loop r13 = 0x10 .. 0x2ba8 step 0x18: ; 465 rows, 24-B stride
new(0x7c) ; row data = 31 int32
vxorps + 4× vmovups ; zero-init 124 B
[row+0x10] = 0x1f ; row cap = 31
[row+0x08] = 0x1f ; row size = 31
[rbx+0x20] = 0x1d1 ; grid outer count = 465
; then 465 latency stores + 285 grid-cell stores follow
NOTE — the constructor backs the cleanly-symbolized accessors
xla::ghostlite::GhostlitePerformance::GetResourceUsage/GetResources/GetLatency, so the class is correctly named in the binary; only the GF-variant constructor is unnamed (sub_1C8D3740). The named ctor symbol_ZN3xla9ghostlite20GhostlitePerformanceC1Evbelongs to the GL variant (@0x1c8cbc80, 476 rows). The two are distinct functions with distinct allocation sizes (0x744/0x2b98vs0x770/0x2ca0) and distinct cell values — confirm them per-gen.
Store accounting
The constructor emits exactly 750 mov DWORD PTR [reg+disp], imm stores: 465 latency entries + 285 grid cells. Every latency store is guarded by cmp [rbx+8], idx; jbe <trap> and writes latency[idx] at file-relative offset idx*4, which fixes the latency-array index as a dense 0..464. The latency store for instruction 0x151, for example, is mov dword ptr [rax+0x544], 0x7f — and 0x544 = 0x151 * 4, so latency[0x151] = 127. This direct-offset addressing is the byte-level proof that the outer grid index is the LloOpcode for every priced row (the classifier GetGhostliteInstruction is an identity on these rows).
The GetResourceUsage Read Path
Purpose
GetResourceUsage(instr, res) is the single accessor every grid reader goes through. It is byte-identical to the GL accessor and the rest of the grid family — two bounds checks, a lea-computed 24-byte row stride, a direct cell read — so one description covers GF, GL, VF, and PF.
Algorithm
// xla::ghostlite::GhostlitePerformance::GetResourceUsage @0x1c8d3700
__int64 GetResourceUsage(perf, u32 instr, u8 res) {
if (perf[0x20] <= instr) // outer bound: grid_outer_count = 465
BUG(); // ud2
row_base = perf[0x18]; // grid body ptr
v = 3 * instr; // lea rax+rax*2
if (row_base[8*v + 8] <= res) // inner bound: row.size = 31
BUG();
return *(u32*)(row_base[8*v] + 4*res); // grid[instr][res]
}
8 * (3 * instr) is the 24-byte row stride (each std::vector<int> header is 24 bytes); row_base[8*v] is the row's data pointer, row_base[8*v+8] is its size. The inline twin sub_1C8DA1A0 — reached from the cost-table throughput dispatch — is instruction-for-instruction identical.
GetLatency(instr) @0x1c8d36e0 is the simpler sibling: if (perf[1] <= instr) BUG(); return latency[instr] (bound [perf+8], returns the pipeline depth the scheduler raises a true-dependency edge to).
// xla::ghostlite::GhostlitePerformance::GetLatency @0x1c8d36e0
__int64 GetLatency(perf, u32 instr) {
if (perf[1] <= instr) BUG(); // latency_size = 465
return *(u32*)(perf[0] + 4*instr);
}
GOTCHA — the outer index is
GhostlitePerformance::Instruction, the output of the per-gen classifierGetGhostliteInstruction @0x1c8b1740(a binary search over a WORD table, range-seeded at 258, reached via GOT relocation), not the raw LLO opcode read off the instruction word. For the priced rows the classifier resolves to the same ordinal as the LloOpcode (proven by thelatency[0x151]direct-offset store above), so the axis is the opcode for those rows; but a reimplementation that indexes the grid by the raw opcode without the classifier will mis-key the unpriced and MXU-fanned rows.
Column order — GetResources
GetResources() @0x1c8d36c0 returns the static kResources byte array @0xb43cdc4 (the lea target, edx = 0x1f = 31): the 31 resource indices in fill / traversal order. The bytes, read from the binary:
GhostlitePerformance kResources @0xb43cdc4 (31 B) — shared by GL and GF:
1d 1c 06 03 04 1a 07 08 09 0a 14 10 18 1e 0b 11 00 19 12 13 0c 15 0d 16 0e 17 1b 01 02 05 0f
This is a permutation of {0..30} (all 31 indices present, no repeats). The Xlu deposit index 0x10 is at traversal position 11 (the array ends in …02 05 0f, i.e. res 0x0f is the last column visited). GetResourceLatency @0x1c8b1e60 iterates this array — it calls GetResources() then loops GetResourceUsage(instr, kResources[i]) per column — to reduce a row into a single per-instruction latency (with a per-column overlap rule — most columns take a plain max, an MXU resource hits the fatal "Did not expected to have MXU resource for " log, and res 0x19 carries a special-cased FIFO-push/pop adjustment).
NOTE — there is a single
xla::ghostlite::GhostlitePerformance::GetResources()::kResourcessymbol in the binary, at@0xb43cdc4. Both the GL (476-row) and GF (465-row) instances are the same class and return this same array from the sameGetResourcesmethod; they do not have distinct per-gen permutations. The sibling classes have their own arrays at adjacent addresses:PufferfishPerformance::kResources @0xb43cd94(20 B),ViperfishPerformance::kResources @0xb43cda8(28 B). The address0xb43cde3is simply0xb43cdc4 + 0x1f— the byte immediately past the 31-entry GhostlitePerformance array, not a second array. Bind the column order to@0xb43cdc4.
The 31 Resource Columns
Purpose
The grid's inner axis is GhostlitePerformance::Resource, a 31-value enum of intra-op EUP / cross-lane / Xlu / matrix-result micro-pipeline ports. There is no Resource::ToString in the binary, so the columns are named functionally — by reading which LLO-instruction class deposits cycles into each (the outer index being the opcode) and by anchoring against the two named accessors GetXluPathReservation and GetResourceLatency. These names are reimplementation-grade in meaning (what physical port each column reserves) but are not literal symbol names.
NOTE — this 31-wide
GhostlitePerformance::Resourceenum is a different, lower-level enum than the 23-slot per-bundleResourceVector(Resource Enum). The grid prices the intra-op micro-pipeline-stage holds; theResourceVectoris the per-bundle functional-unit accumulator the higher-level cost model deposits into.kResourcesgives the grid's column traversal order, not theResourceVectorslot order.
The columns, by occupant LLO class
| col | role (functional name) | occupant LLO band (GF) | example cells |
|---|---|---|---|
| r0 | address / indexed load-store generation | kParameterAddress, kVectorLoadIndexed | 2 |
| r1 | vector load/store sublane-shuffle (Cmem) | kVectorLoadSublaneShuffle, kVectorLoadIndexed | 2 |
| r2 | scalar store / sync-flag (SFRF) | kScalarStore, kVectorSyncFlag{Add,Set}* | 1, 2 |
| r3..r6 | EUP transcendental-prep stages A–D | kVector{Subtract,Pack,Tanh,Pow2,Rsqrt,…} F32/Bf16 | 16/4/9/3, 20/8/13/7 |
| r7 | EUP gain-latch / AndPop setup | kVector{SigShft,Sinq,Cosq,Erf}F32AndPop | 52 |
| r8..r11 | EUP-result-pop FIFO stages A–D | kVector{Tanh,Pow2,Reciprocal,…}Bf16AndPop | 2/1/3/9, 6/5/7/13 |
| r12..r15 | cross-lane / transpose-result stages A–D | kVector{EupResult,XlaneResult}*, kVectorMultiply* | 35/34/40/18 |
| r16 | Xlu / matrix-result deposit port | kVectorCmemResult, kVectorMatresAdd, kVectorTransposeClear, kVectorMultiplyF32 | 4 |
| r17..r20 | mxres-result sub-stages A–D (cmem/matres extended) | kVectorCmemResult/MatresAdd/TransposeClear | 40/44/25/3 |
| r21..r24 | pack / extract / U64-multiply stages A–D | kVector{Xor,PseudoPack,MultiplyU64,Extract}* | 21/50/50/32 |
| r25 | scalar-move | kScalarMove | 1 |
| r26..r27 | vector-min (bf16 / u32) | kVectorMinimum{Bf16,U32} | 4, 1 |
| r28..r30 | BarnaCore scalar-sync-wait / public-access read | kVsetptstate, kBarnaCoreScalarWait{Lt,Ge,Gt,Eq,Ne}*, …SyncPublicAccessRead | 5/7/3 |
The Xlu/matrix-result deposit column (res 0x10) is the only column the convolution cost model reads through the throughput dispatch, and its cell is uniformly 4. It is named CERTAIN because two independent anchors agree: the disassembly stores 4 into it for every matrix-result op, and the dedicated accessor GetXluPathReservation (the GL twin) reads exactly this column.
The Xlu Deposit Column and the Throughput Read
Purpose
One column carries the matrix-result (Xlu) throughput that the convolution cost model reads as its R[2] term. On GF the cost-table helper reads res 0x10 (= 16); on GL the corresponding helper reads res 0x0f (= 15). This is a hardcoded difference between the two GetCyclesForThroughputHelper bodies, not a difference in the grid or in kResources (which are shared): each helper passes its own literal res to the same GetResourceUsage. The constructor writes 4 into the res-0x10 cell for every matrix-result / Cmem / transpose-result instruction:
; per matrix-result row in sub_1C8D3740:
mov dword ptr [row + 0x40], 4 ; row[0x40] = grid[instr][res 0x10] = 4
(0x40 = res 16 * 4 bytes.) The six convolution-relevant mxres rows write this cell at @0x1c8d7722 / 7866 / 79aa / 7bea / 7ce6 / 82ce; many more matrix-result rows in the cross-lane band (0x14d..0x158) write it too — 28 cells carry res 0x10 in total.
The cost-table binding
The GF cost path does not reach res 0x10 through LatencyTableGhostlite::GetXluPathReservation (that accessor reads res 0x0f — see the contrast below). It reaches it through the throughput dispatch GfcCycleTable::GetCyclesForThroughputHelper @0x1c89f400, whose Xlu-family cases call the inline GetResourceUsage twin with res = 16:
// GfcCycleTable::GetCyclesForThroughputHelper @0x1c89f400 (Xlu/mxres cases)
case 23: instr = 344 /*0x158*/; goto xlu; // CT 0x17
case 27: instr = 351 /*0x15f*/; goto xlu; // CT 0x1b
case 28: instr = 337 /*0x151*/; goto xlu; // CT 0x1c
case 29: instr = 339 /*0x153*/; goto xlu; // CT 0x1d
case 30: instr = 341 /*0x155*/; goto xlu; // CT 0x1e
case 31: instr = 345 /*0x159*/; goto xlu; // CT 0x1f
xlu: result = sub_1C8DA1A0(perf /*GfcCycleTable+0x10*/, instr, 16); // res 0x10
So thru(CT 0x17/0x1b/0x1c/0x1d/0x1e/0x1f) = grid[instr][0x10] = 4 for all six. The six mxres rows and their full populated cells, byte-exact from the constructor:
Instruction | LLO name | GetLatency | populated cells (res:cy) |
|---|---|---|---|
0x151 | kVectorCmemResult | 127 | r16:4 r17:40 r18:44 r19:25 r20:3 |
0x153 | kVectorMatresAdd | 101 | r16:4 r17:8 r18:11 r19:35 r20:3 |
0x155 | kVectorTransposeClear | 113 | r16:4 r17:4 r18:20 r19:39 r20:3 |
0x158 | kVectorMultiplyF32 | 127 | r12:35 r13:34 r14:40 r15:18 r16:4 |
0x159 | kVectorMultiplyU32 | 114 | r12:35 r13:34 r14:40 r15:18 r16:4 |
0x15f | kVectorXorU32 | 141 | r16:4 r21:21 r22:50 r23:50 r24:32 |
CONTRAST —
LatencyTableGhostlite::GetXluPathReservation @0x1c8b21c0tail-callsGetResourceUsage(perf, GetGhostliteInstruction(v), 15)— res0x0f— after a direct-handled case forkVectorSetPermutePattern(opcode0x8b= 139, returns3·(is_transpose != 0) + 1). ThisLatencyTableGhostliteaccessor (sharedghostlitenamespace) is not the GF convolution path; the GF convR[2]is read byGfcCycleTable::GetCyclesForThroughputHelperwith res0x10(16), whileGlcCycleTable::GetCyclesForThroughputHelper @0x1c89ed20reads res0x0f(15). The Xlu column index is therefore bound per cost-table helper, not perkResourcesarray: GF helper0x10, GL helper0x0f, both indexing the one shared 31-column grid.
The conv R[2] term
The 6acc60406 convolution cost triple is now fully numeric. The other two terms come from the GF MXU reservation matrix:
R[0]matpush = matpush_count ·array[8]= 2 (bf16) / 4 (fp8)R[1]matmul = op_count ·array[3]· 0.5 / Target / EPF, witharray[3]= 4 (bf16) / 8 (fp8)R[2]Xlu =grid[mxres-instr][0x10]· ChunksPerTile · rem = 4 · ChunksPerTile · rem (dtype-independent)
Row Classification and Column Population
Purpose
Only ~92 of the 465 rows are populated; the grid is sparse. The populated rows cluster into recognizable instruction bands that mirror the EUP / matrix-result micro-pipeline. This section gives the band structure (not all 465 rows — the dense bands are the EUP/transcendental and matrix-result clusters, opcodes 0x121..0x16x) and the per-column cell counts.
The bands (GF)
| band | rows (GhPerf::Instruction) | columns | latency | role |
|---|---|---|---|---|
| address / load-store / sync | 0x2d, 0x30, 0x32, 0x78..0x7f | r0..r2 | 1–3 | address gen, scalar store, sync-flag |
| EUP transcendental-prep | 0x121..0x13b | r3..r6 | 212 / 204 | per-dtype prep stages (carry {4 bf16 / 8 fp8}) |
| EUP gain-push setup | 0x140..0x143 | r7 | 1 | AndPop gain-latch setup (52) |
| EUP-result-pop FIFO | 0x144..0x14b | r8..r11 | 4 / 8 | drain stages after a transcendental |
| cross-lane / transpose result | 0x14c..0x159 | r12..r15 | 1, 114–127 | permute / rotate / multiply result stages |
| Xlu / matrix-result deposit | 0x14d..0x158 (matres ops) | r16 | — | the conv R[2] cell (= 4) |
| mxres-result sub-stages | 0x151/0x153/0x155/0x157 | r17..r20 | 101–141 | cmem / matres / transpose extended |
| pack / extract / U64 | 0x15f..0x168 | r21..r24 | 141, 145 | Xor/PseudoPack/MultiplyU64/Extract |
| tail (move / min / BarnaCore) | 0x17b, 0x1a1..0x1b6 | r25..r30 | 1, 3, 7 | scalar-move, vector-min, BarnaCore scalar-sync-wait |
Per-column cell counts (GF, total 285)
r0:2 r1:2 r2:7 r3:15 r4:27 r5:27 r6:27 r7:4 r8:8 r9:8 r10:8 r11:8
r12:11 r13:11 r14:11 r15:11 r16:28 r17:4 r18:4 r19:4 r20:4
r21:10 r22:10 r23:10 r24:10 r25:1 r26:1 r27:1 r28:4 r29:6 r30:1
The EUP-prep stages r4/r5/r6 (27 cells each) and the Xlu deposit r16 (28 cells) are the most-populated columns — the grid is overwhelmingly an EUP/transcendental + matrix-result pricing table. Note the EUP-prep column r4 carries the same {4 bf16 / 8 fp8} per-dtype magnitude as the MxuLatencyTable matmul cell array[3] (MXU Latency: GF): the two cost tables agree on the per-dtype throughput integers.
The 0xff-Default Fallback
Purpose
The latency array is memset to 0xff (255) before any stores. On the wide GhPerf generations (GL/GF) only the priced rows are subsequently overwritten — ~92 of 465 on GF. The other ~373 rows keep latency = 255 and their grid rows stay all-zero. Those instructions are not priced by this grid; their cost comes from a different path.
The fallback path
An instruction whose GhostlitePerformance::Instruction row is unwritten resolves its cost through the gen-invariant CycleTable::GetResource op→slot table (@0x1c89ce20, table @0xb438aec) plus the per-gen GetCyclesForThroughput, deposited into the 23-slot ResourceVector (Resource Enum); the JfCycleTable default for an unpriced op is 1 cycle. The grid is consulted only for the EUP / matrix-result / cross-lane family it actually prices.
GOTCHA — the
0xffdefault is a real, observable value, not a sentinel the read path filters.GetLatency(instr)for an unwritten row returns 255 — a deliberately large pipeline depth that keeps an unrecognized op from being scheduled aggressively. A reimplementation that zero-inits the latency array (instead of0xff-filling it) will under-cost every instruction the grid does not explicitly price, with no bounds error to signal the mistake. Initialize to 255.
GF vs GL — the Two GhPerf Grids
The GF (6acc60406, v7) and GL (Ghostlite, v6e) grids are the same GhostlitePerformance class with the same methods, the same single kResources array, but two distinct heap instances built by two distinct constructors:
GF (6acc60406 / TPU7x) | GL (Ghostlite / v6e) | |
|---|---|---|
| Constructor | sub_1C8D3740 (unnamed) | GhostlitePerformanceC1 @0x1c8cbc80 |
| Latency alloc | new 0x744 → 465 int32 | new 0x770 → 476 int32 |
| Grid alloc | new 0x2b98 → 465 × 24 | new 0x2ca0 → 476 × 24 |
| Row width | new 0x7c = 31 | new 0x7c = 31 |
| Populated | 285 cells / 92 rows | 358 cells / 132 rows |
kResources | @0xb43cdc4 (one shared array) | @0xb43cdc4 (same array) |
| Xlu res read by helper | res 0x10 (GfcCycleTable helper) | res 0x0f (GlcCycleTable helper) |
| EUP/transcendental latency | 212 (bf16) / 204 (fp8/F32) | 192 / 182 |
| Owner cycle table | GfcCycleTable @0x1c89eec0 | GlcCycleTable @0x1c89e7e0 |
The +11 rows GL has (476 vs 465) follow from the GL ctor's new 0x770/new 0x2ca0 (476 entries) vs GF's new 0x744/new 0x2b98 (465); the GL ctor writes latency[465] at [rax+0x744] — an index that is out of bounds for GF's 465-entry array. The extra opcodes are the v6e shift/saturate and BarnaCore-wait rows that lie past GF's last valid opcode; 6acc60406 drops them. The two share the class contract (object layout, the three accessor methods, the single kResources, the 31-wide enum, the EUP/matrix-result band structure) and the conv R[2] value of 4; what differs is the cell integers, base latencies, row count, and which res literal each cost-table helper passes to the shared GetResourceUsage.
Worked Example — a 6acc60406 bf16 Conv Xlu Deposit
A 6acc60406 bf16 convolution prices its matrix-result (Xlu) term by reading kVectorCmemResult (Instruction 0x151, the conv mxres op for CT 0x1c):
- The cost-table dispatch
GfcCycleTable::GetCyclesForThroughputHelper @0x1c89f400case 28 setsinstr = 0x151,res = 16. GetResourceUsage(perf, 0x151, 16)bounds-checks0x151 < 465and16 < 31, computesrow = grid + 0x151*24, and returnsrow.data[16] = 4.- The conv cost model multiplies:
R[2] = 4 · ChunksPerTile · rem. - Combined with the MXU reservation matrix (MXU Latency: GF):
R[0] = 2,R[1] = 4for bf16 — the triple is{2, 4, 4}.
The op's GetLatency(0x151) = 127 separately bounds its pipeline depth on dependency edges. For fp8 the matmul/matpush halves double to {4, 8} but the Xlu cell stays 4 — the grid cell is dtype-independent.
Function and Data Map
| Name | Address | Role |
|---|---|---|
GfcCycleTable::GfcCycleTable | 0x1c89eec0 | allocs 0x30 perf grid + 0xa0 MxuLatency |
GF GhostlitePerformance ctor | 0x1c8d3740 | fills 465 latency + 285 grid cells (unnamed; no decompile) |
GhostlitePerformance::GetResourceUsage | 0x1c8d3700 | grid[instr][res]; outer [+0x20], 24-B stride, inner [row+8] |
inline GetResourceUsage twin | 0x1c8da1a0 | byte-identical; the CT-throughput call target |
GhostlitePerformance::GetLatency | 0x1c8d36e0 | latency[instr], bound [+0x8] |
GhostlitePerformance::GetResources | 0x1c8d36c0 | returns kResources |
GhostlitePerformance::…::kResources | 0xb43cdc4 | 31-byte column traversal order (shared GL/GF) |
GfcCycleTable::GetCyclesForThroughputHelper | 0x1c89f400 | Xlu cases → GetResourceUsage(instr, 16) |
GetGhostliteInstruction | 0x1c8b1740 | LloOpcode → GhPerf::Instruction (binary search) |
LatencyTableGhostlite::GetResourceLatency | 0x1c8b1e60 | iterates all 31 columns, per-column overlap reduce |
LatencyTableGhostlite::GetXluPathReservation | 0x1c8b21c0 | ghostlite accessor — reads res 0x0f; not the GF conv path |
CycleTable::GetResource | 0x1c89ce20 | fallback op→slot table for 0xff-default rows |
Cross-References
- Performance Family Overview — the grid-family object layout, the shared
GetResourceUsageread path, and the resource-count progression - Performance: GL (GhPerf 476×31) — the Ghostlite v6e grid; +11 rows, Xlu deposit res
0x0f, base latency 192/182 - MXU Latency: GF (6acc60406) — the
MxuLatencyTablereservation matrix, the convR[0]/R[1]terms, andComputeDmaLevels - Resource Enum — the 23-slot per-bundle
ResourceVector, distinct from the 31-wideGhostlitePerformance::Resourcemicro-pipeline enum - MXU Slot — the physical MXU sub-units the matrix-result resource columns reserve