Performance: GL (GhPerf 476×31)
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (buildlibtpu_lts_20260413_b_RC00, BuildID md589edbbe81c5b328a958fe628a9f2207d). The binary is not stripped — every symbol is a demangled C++ name..text/.rodataVMA == file offset;.data.rel.roVMA − 0x200000 == file offset. Other versions will differ.
Abstract
xla::ghostlite::GhostlitePerformance is the Ghostlite (TpuVer 4, v6e) instance of the grid-family Performance object framed in performance-overview. It is one of two GhostlitePerformance constructors internally nicknamed GhPerf — the named v6e variant documented here and the mis-symbolized v7 (6acc60406) variant on performance-gf-ghperf. It prices each LLO instruction two ways: a flat latency array indexed by GhostlitePerformance::Instruction, read by GetLatency to set the depth of a true-dependency edge; and a 2D Instruction × Resource occupancy grid, read by GetResourceUsage(instr, res) to charge how many cycles the instruction holds each intra-op micro-pipeline port. The grid is the libtpu analog of an LLVM SchedMachineModel ProcResource/WriteRes table, reconstructed by decoding the constructor that fills it.
The Ghostlite grid is 476 × 31: 476 instruction rows (the v6e GhostlitePerformance::Instruction cardinality) and 31 resource columns (GhostlitePerformance::Resource, the EUP/Xlu/MXU-result micro-pipeline ports). The object layout, the GetResourceUsage read path with its two bounds checks and 24-byte row stride, and the resource count 31 are all shared with the Viperfish grid and the GF (6acc60406) twin; the constructor — _ZN3xla9ghostlite20GhostlitePerformanceC1Ev @0x1c8cbc80, the one GhPerf constructor that still carries a clean symbol — and the per-cell integers are v6e-specific. The grid does not price the MXU matmul/matpush reservation; that is a separate MxuLatencyTable (mxu-latency-gl). The GhPerf grid and the MxuLatencyTable agree on the per-dtype throughput magnitudes ({4 bf16 / 8 fp8}) but are read for different questions and live in different sub-objects of the owning GlcCycleTable.
This page documents the Ghostlite grid: the object the constructor builds; the byte-identical GetResourceUsage/GetLatency read paths; how an LLO opcode reaches a grid row through the GetGhostliteInstruction @0x1c8b1740 classifier and its MXU/permute/transpose latch-mode fan-out; the 31 resource columns described by their occupant instruction class instead of dumped cell-by-cell; the Xlu/matrix-result deposit column (GL res 0x0f) that the convolution cost model reads as its R[2] term; and the 0xff-default fallback path that the ~344 unpriced rows take. The full per-cell dumps of the populated rows are described by band, not transcribed in full.
For reimplementation, the contract is:
- The
GhostlitePerformanceobject layout: a heap latency array (new 0x770, 476 int32, memset0xff) and a heap 2D grid (new 0x2ca0, 476 × 24-bytestd::vector<int>), each row a 31-wide zero-init vector. - The
GetResourceUsage(instr, res)andGetLatency(instr)read paths, including the two bounds checks (outer at[this+0x20], inner at[row+8]) and the 24-byte row stride. - The opcode → row classifier
GetGhostliteInstruction: the binary-search WORD remap table plus the explicit MXU/permute/transpose/compare switch that fans one opcode into many rows. - The 31 resource columns by band, and the Xlu/matrix-result deposit column GL res
0x0f(cell= 4for every matrix-result op) thatGetXluPathReservationreads as the convR[2]cycle. - The
0xff-default fallback: the ~344 unpriced rows keep the memset sentinel and are priced through a separateCycleTablepath, not the grid.
| Class | xla::ghostlite::GhostlitePerformance (internally "GhPerf", v6e variant) |
| Constructor | GhostlitePerformanceC1Ev @0x1c8cbc80 (31284 B; clean symbol; latency new 0x770, grid new 0x2ca0) |
| Read path | GhostlitePerformance::GetResourceUsage @0x1c8d3700; GetLatency @0x1c8d36e0; GetResources @0x1c8d36c0 |
| Grid shape | 476 rows (Instruction, count 0x1dc) × 31 columns (Resource); 358 populated cells across 132 rows |
| Latency array | new 0x770 = 1904 B = 476 int32; memset 0xff (255 default for ~344 unpriced rows) |
| Row classifier | ghostlite::GetGhostliteInstruction @0x1c8b1740 (WORD remap + MXU latch-mode fan-out) |
| Xlu deposit column | GL res 0x0f (15); cell = 4 for every matrix-result/cmem/transpose-result op |
| kResources order | @0xb43cdc4 (31 bytes, all of {0..30}; 0x0f last) |
| Owning CycleTable | GlcCycleTable @0x1c89e7e0 — grid at this+0x10, MxuLatencyTable at this+0x18 |
| Source file | …/jellyfish/target/ghostlite/latency_table_gl.cc (CHECK/LogFatal anchors) |
The Performance Object
Purpose
The constructor builds two heap allocations and fills only the priced cells. Everything the read path touches — the latency array, its count, the grid pointer, the outer count, and each row's std::vector<int> header — is laid out by GhostlitePerformanceC1Ev @0x1c8cbc80. The unpriced rows are deliberately left at a sentinel so a downstream caller can tell "this op is not priced by the grid" from "this op holds zero cycles".
Structure
The object is the grid-family layout shared by Pufferfish, Viperfish, and both GhPerf variants; only the widths differ (Ghostlite uses 476 rows / 31 columns):
struct GhostlitePerformance { // built by GhostlitePerformanceC1Ev @0x1c8cbc80
int32* latency; // +0x00 ; new 0x770 (476 int32), memset 0xff
u64 latency_size; // +0x08 ; = 0x1dc (476) — GetLatency bound
u64 latency_cap; // +0x10 ; = 0x1dc (476)
vector<int>* grid; // +0x18 ; new 0x2ca0 (476 × 24-byte vector<int>)
u64 grid_outer_count; // +0x20 ; = 0x1dc (476) — GetResourceUsage outer bound
u64 grid_outer_cap; // +0x28 ; = 0x1dc (476)
// each grid row (24-byte std::vector<int>): { int* data (new 0x7c = 31 int32, zero-init),
// size = 0x1f (31), cap = 0x1f (31) }
};
The constructor prologue is byte-confirmed at @0x1c8cbc80:
function GhostlitePerformanceC1Ev(this): // @0x1c8cbc80
this.latency = new(0x770) // @0x1c8cbc96 — 1904 B = 476 int32
this.latency_cap = 0x1dc // [this+0x10] = 476
memset(this.latency, 0xff, 0x770) // @0x1c8cbcb3 — sentinel fill
this.latency_size = 0x1dc // [this+0x08] = 476
this.grid = new(0x2ca0) // @0x1c8cbcd2 — 12192 B = 476 × 24
this.grid_outer_cap = 0x1dc // [this+0x28] = 476
for r in 0 .. 475: // loop bound r15 == 0x2cb0 (0x18 + 476*24)
row.data = new(0x7c) // @0x1c8cbcf8 — 124 B = 31 int32
zero(row.data, 0x7c) // vxorps + vmovups ×4 (@0x1c8cbd02..)
row.size = 0x1f ; row.cap = 0x1f // [row+8]=[row+0x10]=31
this.grid_outer_count = 0x1dc // [this+0x20] = 476
// ... then 834 DWORD-immediate stores: 476 latency entries + 358 grid cells
After the headers are set, the body issues exactly 834 mov DWORD PTR [addr], imm stores: 476 write every latency slot (the first three, latency[0]=1, latency[1]=1, latency[2]=2, are stored at @0x1c8cbd50/64/79 into [this+0x00]), and the remaining 358 write the populated grid cells. The store count being exactly 476 + 358 is the integrity check that every store was classified.
QUIRK — the GL constructor at
@0x1c8cbc80carries the clean symbol_ZN3xla9ghostlite20GhostlitePerformanceC1Ev, but the GF (6acc60406) GhPerf constructor at@0x1c8d3740is mis-symbolized in the binary (it carries the placeholder symbolsub_1C8D3740); it is theGfcCycleTable-allocated variant, structurally the sameGhostlitePerformancelayout (31-wide rows) with a 465-row instruction set instead of 476. The two are distinct constructors with distinct cell values and distinct base latencies (GL EUP/transcendental 192/182 vs GF 212/204), not a shared instance. Seeperformance-gf-ghperf.
The memset-0xff default
The latency array is memset to 0xff (255) before any cell is written. Unlike Pufferfish and Viperfish — where every latency slot is subsequently overwritten and the default never survives — Ghostlite writes only the ~132 priced rows. The remaining ~344 instructions keep 0xff = 255 in both the latency array and (implicitly) leave their entire 31-wide grid row at the zero-init value. Those unpriced instructions are not priced by the grid: their cost comes from the gen-invariant CycleTable::GetResource path (@0x1c89ce20) or the JfCycleTable default of 1 cycle. A reimplementation must keep 255 as a distinguishable sentinel — a caller that reads a 0xff latency knows to fall back, whereas a zero would be indistinguishable from a real zero-latency op.
GOTCHA — the grid is built per
GlcCycleTable, andGlcCycleTableis constructed from atpu::TpuVersion(the only caller of this ctor besidesGlcCycleTableisLatencyTableGhostlite::LatencyTableGhostlite(tpu::TpuVersion)). The 476-row Ghostlite set and the 465-row6acc60406set are selected by whichCycleTablesubclass is instantiated, not by a runtime branch inside one shared object. Do not key the grid by a single global instruction enum across generations.
The GetResourceUsage Read Path
Purpose
GetResourceUsage(instr, res) is the single accessor the throughput model and the GetXluPathReservation accessor go through to read one grid cell. It is byte-identical across Pufferfish, Viperfish, and both GhPerf variants — the same two bounds checks and the same lea-computed 24-byte row stride — which is why one description covers all four grid generations.
Algorithm
The decompiled body at @0x1c8d3700 is short and exact:
function GhostlitePerformance::GetResourceUsage(this, instr, res): // @0x1c8d3700
if this.grid_outer_count <= instr: // [this+0x20] ; outer bound = 476
BUG() // ud2 — out-of-range instruction
row_base = this.grid // [this+0x18]
row = row_base + instr*24 // v4 = 3*instr ; row = base + 8*v4 → 24-byte stride
if row.size <= res: // [row+8] ; inner bound = row width (31)
BUG()
return row.data[res] // *(u32*)([row+0] + 4*res) = grid[instr][res]
GetLatency(instr) @0x1c8d36e0 is the simpler sibling — latency[instr] bounded by [this+0x08] (= 476) — returning the instruction's pipeline depth (the value the scheduler raises a true-dependency edge to):
function GhostlitePerformance::GetLatency(this, instr): // @0x1c8d36e0
if this.latency_size <= instr: // [this+0x08]
BUG()
return this.latency[instr] // *(u32*)([this+0x00] + 4*instr)
GetResources() @0x1c8d36c0 returns the kResources traversal order — a .rodata byte array at @0xb43cdc4 listing the 31 resource indices in fill order: 1d 1c 06 03 04 1a 07 08 09 0a 14 10 18 1e 0b 11 00 19 12 13 0c 15 0d 16 0e 17 1b 01 02 05 0f. All of {0..30} appear exactly once; the Xlu deposit column 0x0f is last in traversal order. GetResourceLatency @0x1c8b1e60 iterates this array to sum a row (see below).
GOTCHA — the OUTER index is the v6e
GhostlitePerformance::Instruction, not the raw LLO opcode. Every populated row's index resolves to a coherent LLO opcode (so the axis is the opcode for the priced rows — row0x2d=kParameterAddress,0x77=kScalarLoad,0x16b=kScalarCompare,0x1db= a barnacore-wait extension), but the mapping is the per-gen classifierGetGhostliteInstruction, and the MXU/permute band fans a single opcode out to many ordinals via a latch-mode lookup. A reimplementation that indexes the grid directly by LLO opcode will mis-read every MXU row.
How an opcode reaches a row — GetGhostliteInstruction
The OUTER index is produced by ghostlite::GetGhostliteInstruction @0x1c8b1740. It first does a branchless binary search over a 258-entry WORD remap table (a (LloOpcode, GhPerf::Instruction) pair table reached GOT-relative), and on a hit returns the paired remap word. Opcodes that miss the remap table fall through to an explicit switch that handles the bands a single opcode cannot map 1:1 — the MXU latch ops, the permute/transpose ops, and the compare op — fanning each into many rows:
function GetGhostliteInstruction(value): // @0x1c8b1740 — latency_table_gl.cc
opcode = value.opcode // WORD[value]
// 1. Branchless binary search over the 258-entry WORD remap table (GOT-relative).
hit = bsearch(remap_table, opcode) // 0x1c8b1762..17a1
if hit valid && opcode >= hit.key:
return hit.remapped_instruction // paired WORD
// 2. Fall-through switch on the raw opcode (jt @0xb43b34c, bound opcode-1 ≤ 0xa5).
switch (opcode):
case 1,2,3,4: // kVectorReadIar family — keyed on iar()
CHECK(iar.has_value()) // gl.cc:603/609/615/621
return base ± (iar != 0) // e.g. 471 - (iar==0)
case 139: // kVectorSetPermutePattern (0x8b)
return set_permute_pattern_mode==0 ? 363 : 364
case 141,143,145,147,149: // matpush latch band
mode = latch_mode()
CHECK(target.SupportsGainLatchMode(mode)) // gl.cc:477
return word_B43BDF4[mode] // latch-mode → row WORD
case 142,144,146,148,150: // matprep latch band
return latch_table[latch_mode() - 10] // gl.cc:514
case 155,156,163: // matmul band A — keyed on data format
return matmul_table_A[matmul_data_format() - 1]
case 159,160,164: // matmul band B
return matmul_table_B[matmul_data_format() - 1]
case 166: // transpose — keyed on vxpose_mode()
return {0:368, 1:370, 2:372}[vxpose_mode()]
case 359: // compare — keyed on comparison()
return compare_table[comparison.kind][comparison.dir]
default: LogFatal("Operation not supported") // gl.cc:761
The fall-through switch is the MXU/permute/transpose/compare fan-out. A single matmul opcode (155/156/163) maps to up to eight rows by MatmulDataFormat; a single latch opcode (141/143/145/147/149) maps to a row by GainLatchMode through word_B43BDF4[]; the permute opcode 139 (0x8b) maps to one of two rows by set_permute_pattern_mode(). This is why the grid has 476 Instruction rows but the LLO ISA has far fewer MXU opcodes: the latch/format modifiers expand them. The source file is latency_table_gl.cc, and the CHECK/LogFatal line numbers (477, 514, 603–621, 761) are the in-binary anchors.
QUIRK — the classifier traps (
LogFatal) on any opcode it does not handle rather than returning a default. So the grid OUTER index is total over the priced opcode set but partial over the LLO ISA: an opcode that is neither in the 258-entry remap table nor in the fall-through switch is a hard error, not a0xff-default lookup. The0xffdefault is the latency-array fallback for instructions that do classify to a row but whose row was never written, not a classifier fallback.
The Resource Columns
Purpose
The grid's INNER axis is the GhostlitePerformance::Resource enum: 31 intra-op EUP/MXU/Xlu micro-pipeline reservation ports. The enum has no ToString in the binary, so the columns are named functionally — by reading which LLO-instruction class deposits cycles into each one (the OUTER index being the opcode) and by anchoring against the two named accessors GetXluPathReservation and GetResourceLatency. These names are reimplementation-grade in meaning (which physical port each column reserves) but are not literal symbol names (MEDIUM).
NOTE — the per-gen
Resourceenum (31 columns) is a different, lower-level enum than the 23-slot per-bundleResourceVector(resource-enum). The grid prices the intra-op micro-pipeline-stage holds; theResourceVectoris the per-bundle functional-unit accumulator the higher-level cost model deposits into.kResources @0xb43cdc4gives the grid's column traversal order, not theResourceVectorslot order.
The column bands
The 31 columns group into recognizable bands that track the EUP/Xlu/MXU-result micro-pipeline. The table below names each band by its occupant LLO class rather than dumping all 358 cells. Column indices are GL-specific (the GF (6acc60406) variant has a +1 shift on the result bands because its EUP-prep group is 4 columns wide vs GL's 3):
| GL cols | Band | Occupant LLO class | Typical cells |
|---|---|---|---|
| r0–r2 | Address / load-store / sync | kVectorStoreIndexed/Masked, kVectorCmemStore, kScalarLoad/Store | 2–3 |
| r3–r5 | EUP transcendental-prep | kVector{Subtract,Compose,Pack,Tanh,Pow2,Rsqrt,…} F32/Bf16 | {13,4,3} / {17,8,7} / {25,16,15} |
| r6 | EUP gain-push setup | kVectorMultiplyU32/U16 | 48 |
| r7–r10 | EUP-result-pop FIFO (4 stages) | kVector{…}Bf16AndPop, kVectorXorU32/AndU32 | {2,1,3,19} / {6,5,7,39} |
| r11–r14 | Cross-lane / transpose result (4 stages) | kScalar{Compare,AddCarry,Multiply} | {35,34,40,18} |
| r15 | Xlu / matrix-result deposit | kScalarMultiplyU32/F32, kScalarAddS32, kScalarSubtractF32 (matres) | 4 |
| r16–r19 | mxres-result sub-stages (4) | matres-result extensions (kScalar{Multiply,Add,Subtract} F32/S32) | {40,44,25,3} / {8,11,35,3} |
| r20–r23 | Pack / extract / U64 stages (4) | kScalar{Ceil,CLZ}, kVectorCLZ, extract/U64 ops | {21,25}/{48,50}/{50,54}/{32,36} |
| r24 | Mask-move | kVectorMaskMove | 1 |
| r25 | Shift / saturate (v6e-only band) | kVectorShift{RightLogical,RightArithmetic,LeftLogical}, kScalar*Max | 1 (×14) |
| r26–r27 | BarnaCore scatter-gradients | kBarnaCore{Global,Local}ScatterGradients | 5 / 4 |
| r28–r30 | BarnaCore scalar-sync-wait tail | barnacore-wait band (opcodes 0x1d0..0x1db) | 5 / 7 / 3 |
The defining column is r15 (0x0f): the Xlu / matrix-result deposit port. Its cell is 4 for every matrix-result, cmem-result, and transpose-result op — kScalarMultiplyU32 (0x16f), kScalarMultiplyF32 (0x170), kScalarAddS32 (0x172), kScalarSubtractF32 (0x174). This is the value the convolution cost model reads as its R[2] Xlu term. The full per-cell rows are not transcribed here; the structure above lets a reimplementer reconstruct any populated row from its band and its occupant class.
The matrix-result / Xlu deposit column
GetXluPathReservation @0x1c8b21c0 is the dedicated accessor that reads exactly column 0x0f. It special-cases the permute-pattern opcode and otherwise tail-calls GetResourceUsage on res 15:
function LatencyTableGhostlite::GetXluPathReservation(this, value): // @0x1c8b21c0
if value.opcode == 0x8b: // kVectorSetPermutePattern, handled directly
return 3 * (is_transpose(value) ? 1 : 0) + 1
instr = GetGhostliteInstruction(value)
return GhostlitePerformance::GetResourceUsage(this.perf /*[this+0x1d0]*/, instr, 15) // res 0x0f
This decompiled body is the anchor that fixes the Xlu column at 0x0f for Ghostlite: the res = 15 literal is the third argument to GetResourceUsage, and the [this+0x1d0] (= [this + 58*8]) is the GhostlitePerformance sub-object inside the LatencyTableGhostlite. The Xlu cell (= 4) is the GL value for the conv cost triple R[2]; the matpush and matmul halves come from the separate mxu-latency-gl reservation matrix. The progression across generations is res 6 (PF, conflict-penalty) → res 0x0e (VF) → res 0x0f (GL) → res 0x10 (GF), tracking the MXU geometry.
GetResourceLatency — the row-summing consumer
GetResourceLatency @0x1c8b1e60 is the consumer that turns the grid into a hazard-cycle count for an instruction pair. It calls GetResources() for the 31-column traversal order, and for each column reads GetResourceUsage(instr, res) for both instructions, accumulating a max over the columns both instructions touch:
function LatencyTableGhostlite::GetResourceLatency(this, producer, consumer): // @0x1c8b1e60
pi = GetGhostliteInstruction(producer)
ci = GetGhostliteInstruction(consumer)
cols = GetResources() // kResources @0xb43cdc4, 31 entries
result = 0
for res in cols: // traversal order
a = GetResourceUsage(perf, pi, res)
if a == 0: continue
if GetResourceUsage(perf, ci, res) == 0: continue
switch (res):
case 0,1,0x0f,0x18,0x1c,0x1d,0x1e: // "max" columns — take the larger hold
result = max(result, a)
case 0x19: // shift/saturate band — +1 for one opcode range
result = max(result, a + (consumer.opcode-389 < 14))
case 0x1b: // FIFO column — only if push/pop same FIFO
if LloInstructionsPushOrPopSameFifo(producer, consumer):
result = max(result, a)
case 2: LogFatal("Unimplemented") // gl.cc:1040
case 3..0x0a, 0x1a: LogFatal("Did not expect MXU resource") // gl.cc:1066
return result
The per-column switch is the proof that the 31 columns are not interchangeable: columns 0..1, 0x0f (Xlu), 0x18, 0x1c..0x1e are simple "take the larger hold" columns; column 0x1b is a FIFO column that only counts when both ops touch the same FIFO (LloInstructionsPushOrPopSameFifo); columns 3..0x0a and 0x1a are MXU-reservation columns that this path explicitly refuses to handle ("Did not expected to have MXU resource for ", gl.cc:1066) — those are priced by the MxuLatencyTable, not here. Column 0x19 is the v6e shift/saturate band that adds one extra cycle for a specific consumer opcode range (opcode - 389 < 14).
NOTE — the
LogFatalon the MXU columns (3..0x0a,0x1a) is a deliberate assertion, not dead code: a producer/consumer pair that both deposit into an MXU column should never reachGetResourceLatency, because the MXU hazard is computed by the separateMxuOpHoldIssuesrecurrence over theMxuLatencyTable. The grid columns3..0x0acarry cells (the EUP-prep band) but are read by other paths, never by this pairwise hazard sum.
The 476-vs-465 Instruction-Set Delta
Ghostlite (v6e) has 476 Instruction rows; 6acc60406 (v7) has 465 — a +11 v6e surplus. The extra v6e opcodes surface in two places in the grid: (a) the GL-only populated r25 shift/saturate band (14 cells, the shift-right/left and scalar/vector-max overflow-saturate ops 0x19a..0x1bb, which GF either fuses or routes elsewhere — GF's r25-equivalent has a single cell); and (b) the extended BarnaCore-wait opcodes 0x1d0..0x1db, which are past the 461-entry LLO opcode-name range (the last named LLO opcode is 0x1cc = kBarnaCoreVectorStore). So the +11 v6e-extra instructions are the additional shift/saturate plus barnacore-wait variants; 6acc60406 drops them for a smaller 465-row set. The names of the 0x1d0..0x1db rows are inferred from their value pattern matching the GF barnacore-wait band (MEDIUM); they were not individually resolved.
Worked Example — the conv R[2] Xlu cell
A Ghostlite convolution cost computation reads the Xlu throughput per matrix-result op. GetXluPathReservation @0x1c8b21c0 classifies the op via GetGhostliteInstruction, then reads GetResourceUsage(perf, instr, 15). For each matres-result op the grid holds grid[instr][0x0f] = 4:
opcode kScalarMultiplyF32 (0x170) → GhPerf row 0x170, lat 123 → grid[..][0x0f] = 4
opcode kScalarAddS32 (0x172) → GhPerf row 0x172, lat 97 → grid[..][0x0f] = 4
opcode kScalarSubtractF32 (0x174) → GhPerf row 0x174, lat 109 → grid[..][0x0f] = 4
So the GL conv R[2] Xlu term is 4 · ChunksPerTile · rem. The R[0] matpush and R[1] matmul halves come from the mxu-latency-gl reservation matrix (matpush {2 bf16 / 4 fp8}, matmul {4 bf16 / 8 fp8}). The per-op base latency (123/97/109 here, and 192/182 for the EUP-prep band) is the v6e silicon's pipeline depth — lower than 6acc60406's 212/204, the visible silicon-generation divergence. The Xlu cell 4 itself is dtype-independent and matches the 6acc60406 value; only the surrounding latencies differ.
Function Map
| Function | Address | Role |
|---|---|---|
ghostlite::GhostlitePerformance::GhostlitePerformanceC1Ev | 0x1c8cbc80 | GL grid ctor — latency new 0x770, grid new 0x2ca0; 834 stores (476 lat + 358 grid) |
ghostlite::GhostlitePerformance::GetResourceUsage | 0x1c8d3700 | grid read — outer bound [this+0x20], 24-byte stride, inner bound [row+8] |
ghostlite::GhostlitePerformance::GetLatency | 0x1c8d36e0 | latency read — latency[instr], bound [this+0x08] |
ghostlite::GhostlitePerformance::GetResources | 0x1c8d36c0 | returns kResources @0xb43cdc4 (31-byte traversal order) |
ghostlite::GetGhostliteInstruction | 0x1c8b1740 | LLO opcode → Instruction row; WORD bsearch + MXU latch fan-out (jt @0xb43b34c) |
ghostlite::LatencyTableGhostlite::GetXluPathReservation | 0x1c8b21c0 | reads Xlu column res 0x0f; conv R[2] accessor |
ghostlite::LatencyTableGhostlite::GetResourceLatency | 0x1c8b1e60 | pairwise hazard sum over the 31 columns; per-column switch |
GlcCycleTable::GlcCycleTable | 0x1c89e7e0 | owning CycleTable — grid at this+0x10, MxuLatencyTable at this+0x18 |
CycleTable::GetResource | 0x1c89ce20 | fallback cost path for the 0xff-default unpriced rows |
Related Components
| Name | Relationship |
|---|---|
performance-overview | the family framing: flat vs grid Performance, the shared read path, the resource progression |
performance-gf-ghperf | the 6acc60406 (v7) GhPerf twin — 465×31, Xlu res 0x10, base latency 212/204 |
mxu-latency-gl | the separate Ghostlite MxuLatencyTable (matmul/matpush reservation) that co-exists with this grid |
resource-enum | the 23-slot per-bundle ResourceVector, distinct from the 31-wide GhostlitePerformance::Resource |
slot-mxu | the physical MXU sub-units the resource columns reserve |
Cross-References
- Performance Family Overview — the grid-family object layout, the shared
GetResourceUsage/GetLatencyread paths, and the 7→7→20→28→31→31 resource progression - Performance: GF (GhPerf 465×31) — the
6acc60406(v7) GhPerf twin; Xlu deposit res0x10, base latency 212/204, and the 476-vs-465 instruction-set delta - MXU Latency: GL (Ghostlite) — the separate v6e
MxuLatencyTablereservation matrix; the convR[0]/R[1]matpush/matmul halves to this grid'sR[2] - Resource Enum (23-slot) — the per-bundle
ResourceVector, distinct from the per-genPerformance::Resourcemicro-pipeline columns - MXU Slot — the LLO MXU instruction slot whose latch-mode/format modifiers drive the
GetGhostliteInstructionfan-out