Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

IarsPerTensorCore and the Non-MXU Cost Band

Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d, not stripped — full C++ symbols). Other versions differ. Every integer below was read out of .rodata (LUTs and embedded chip-parts protos) by hand and cross-checked against the IDA decompile; status is in the Confidence columns.

Abstract

This page consolidates three per-generation cost-model quantities that share one root: a count and a classifier surface, all keyed by the same CycleTable::Instruction enum the MXU throughput tables use. First, the IarsPerTensorCore value — the count of Index-Add Registers (IAR0/IAR1) per TensorCore that bounds the matpush/indexed-load addressing path and feeds the cost model. Second, the non-MXU CycleTable band (Instruction 0x11..0x20) on Jellyfish/Dragonfish — the vector-ALU, cross-lane (Xlu), and EUP cost cells whose values are pinned (on MXU Latency: JF / DF) but whose producer is documented here. Third, the Viperfish matprep-stage ordinal binding — the mechanism that turns a matprep opcode into a reservation array.

The unifying fact is that IarsPerTensorCore is not a code constant. It is DWORD[Target+0x4a8], written once by the shared Target::Init from a VectorIsa field of the embedded per-generation *_chip_parts.binarypb proto, loaded at runtime via the embed:// filesystem. Extracted from the embedded blobs, the value is 2 on every generation — the IAR file is gen-stable (IAR0/IAR1), matching the 1-bit IarField ceiling in the ISA encoding. The version pin above is therefore necessary but not sufficient: a chip whose proto differs would change the count.

For reimplementation, the contract is:

  • IarsPerTensorCore = DWORD[Target+0x4a8], accessor Target::IarsPerTensorCore (0x1d617280); sole writer Target::Init (0x1d60fc20) from vector_isa() field 7; value 2 all gens.
  • The JF/DF non-MXU band Instruction 0x11..0x20 is emitted by the HLO-level cost model (CostModel::RecordHloCycles + cost_model_util::Record*) as direct ordinal immediates keyed on HLO opcode / PrimitiveType / transfer role — there is no LLO→Instruction classifier for it (the MXU classifier CycleTableInstruction covers only 0x00..0x10).
  • The VF matprep stages do not carry standalone classifier ordinals: the matprep opcodes 0x97..0x9a FATAL in GetViperfishInstruction, and the reservation is produced by MxuLatencyTable::GetResourceUsage keyed on a MatpushModifier{MatmulDataFormat, is_transpose, Msr}.
IarsPerTensorCoreDWORD[Target+0x4a8] = 2 all gens (IAR0/IAR1, gen-stable)
AccessorTarget::IarsPerTensorCore 0x1d617280 (return *(uint32*)(this+0x4a8))
Sole writerTarget::Init 0x1d60fc20 from TpuSequencerParts::vector_isa() field 7
Proto sourceembed://tpu_chip_parts/<version>_chip_parts.binarypb, VectorIsa.f7
ISA ceiling1-bit IarField (0x1ee3b380, >>13 & 1) ⇒ 2 IARs
Non-MXU bandCycleTable::Instruction 0x11..0x20 (vector-ALU / Xlu / EUP)
Non-MXU producerCostModel::RecordHloCycles 0x130bbfe0 + cost_model_util::Record* (direct ordinal immediates)
VF matprep bindingMxuLatencyTable::GetResourceUsage 0x1c8ae5c0, FlatHashMap<MatpushModifier, array<int,19>>
VF matprep opcodes0x97..0x9a FATAL in GetViperfishInstruction 0x1c8a3300

IarsPerTensorCore — Source and Per-Gen Value

The Accessor and the Sole Writer

IarsPerTensorCore is a one-instruction accessor reading offset 0x4a8 of the Target object (298 * 4 = 0x4a8):

// xla::jellyfish::Target::IarsPerTensorCore  @ 0x1d617280  (verified)
__int64 Target::IarsPerTensorCore(this) {
    return *((unsigned int*)this + 298);          // DWORD[Target + 0x4a8]
}

The sole writer of Target+0x4a8 is the shared Target::Init — no per-generation *Target constructor writes it. In the register-count block of Init, the field is read from the VectorIsa C++ sub-struct (embedded at TpuSequencerParts+0x1c), gated by the VectorIsa-present byte:

// xla::jellyfish::Target::Init  @ 0x1d60fc20  (verified — the +0x4a8 store)
v93 = TpuSequencerParts::vector_isa(seqparts);    // = &seqparts[+0x1c]
if ( *(uint8_t*)(v93 + 0x18) == 1 ) {             // VectorIsa-present gate (seqparts+0x34)
    v94 = *(uint32_t*)(v93 + 0x14);               // IarsPerTensorCore  (seqparts+0x30 = VectorIsa.f7)
    *((uint32_t*)Target + 298) = v94;             // store at Target+0x4a8   ← the sole writer
    *((uint64_t*)Target + ...) = *(uint64_t*)(v93 + 0xc);   // adjacent MxusPerTensorCore qword
}

TpuSequencerParts::FromProto packs the proto VectorIsa submessage (six int32 fields, proto field numbers 2..7) into the C++ struct, with field 7 landing at struct +0x30Target+0x4a8. So IarsPerTensorCore is the VectorIsa field-7 proto value, loaded at runtime by TpuChipParts::DefaultsForVersion via tsl::ReadBinaryProto("embed://tpu_chip_parts/<version>_chip_parts.binarypb").

The Extracted Per-Gen Values

The per-gen blobs are embedded uncompressed in .rodata. Parsing the VectorIsa submessage out of each blob (the f2=128, f3=8 prefix locates it; f7 is the IAR count) gives, byte-exact:

Gen (codename)lane_count f2sublane_count f3mxu_count f5xlu_count f6iar_count f7 = IarsPerTensorCore
v2 Jellyfish1288112
v3 Dragonfish1288212
v4 Pufferfish1288422
v5p Viperfish1288432
v6e Ghostlite1288222
v7 6acc604061288222

So IarsPerTensorCore = 2 on every generation — the IAR file does not grow v2→v7. This is exactly consistent with the ISA-encoding ceiling: the SetIar slot's IarField accessor (0x1ee3b380) is >>13 & 1, a single bit, which can address only IAR0/IAR1. The VectorIsa field names are not inferred — TpuSequencerParts::FromProto (0x20b30700) validates each by name ("Invalid lane_count in vector_isa field", "Invalid sublane_count …", "Invalid mxu_count …", "Invalid xlu_count …", "Invalid iar_count …"), so f2=lane_count=128 (constant across all gens), f5=mxu_count (1/2/4/4/2/2), and f6=xlu_count (1/1/2/3/2/2) are CONFIRMED, not merely positional.

NOTE — this value is data, not code. The accessor and the 1-bit IarField are immutable code; the value is a runtime proto field. The "2 all gens" claim holds for this binary's embedded chip-parts blobs only. A reimplementation must read it from the per-gen proto, not bake it in. See Chip-Parts Binarypb and ../isa/slot-matprep-iar-latch.md for the IAR value-field layout.

The Consolidated Per-Gen Register-File Table

IarsPerTensorCore is the last column of the per-gen register-file block at Target+0x498..+0x4a8, all written by the same Target::Init from the same chip-parts blobs. The four register counts are re-derived byte-exact from the embedded protos:

GenSREG +0x498VREG +0x49cVMREG +0x4a0PREG +0x4a4IARS +0x4a8
v2 Jellyfish32328152
v3 Dragonfish32328152
v4 Pufferfish32328152
v5p Viperfish326416142
v6e Ghostlite326416142
v7 6acc60406326416142

(Field offsets byte-exact from Target::Init + RegisterCount; values from the embedded chip-parts protos. The SREG/VREG/VMREG/PREG order follows Init's seq-0 type dispatch.)


The JF/DF Non-MXU CycleTable Band (Instruction 0x11..0x20)

The Producer

The MXU classifier CycleTableInstruction (0x1c89ca80) maps only the MXU band — matmul opcodes 0x8d..0x96 and matprep/matpush opcodes 0x9b..0xa5 — into CycleTable::Instruction 0x00..0x10. There is no LLO→Instruction classifier for ordinals 0x11..0x20. Those ordinals are emitted by the HLO-level cost model as direct ordinal immediates (mov esi, 0xNN) keyed on HLO opcode, PrimitiveType, or memory-transfer role, then passed to GetCyclesForThroughput (JfCycleTable vtable +0x10) for the cycle value and ResourceVector::Acc(Resource, cycles) for the accumulator slot:

  • CostModel::RecordHloCycles (0x130bbfe0) — per-HLO entry; switches on HloOpcode and emits 0x12/0x13/0x14 for elementwise vector ops.
  • cost_model_util::Record*RecordPackAndStCycles (0x13844740, 0x16), RecordBroadcastSublaneChunkCycles (0x138448c0, 0x17/0x1b), RecordMemXferCyclesImpl (0x13844e80); Broadcast::RecordCostCycles (0x136e6940, 0x17); the Record{Conv,DepthwiseConv}KernelCycles / RecordConvolutionCycles / SpatialMajorConvolution::CalculateWindowMxuCycles emitters (0x16/0x19/0x1c/0x1f).

The Band Table

The offsetLUT (0xb438b70) and resLUT (0xb438aec) were re-read byte-exact; the cycle value is the priced PerformanceJf cell under the valid mask 0x19FFC0821. JF == DF for the entire band (none of these offsets is the DF-override +0x28/+0x2c).

InstrResResourceVectorpricedPerfOffJF/DF cycemitting context
0x11r6VectorEupno0x01unpriced (SparseCore decomposer)
0x12r4VectorAlu1yes0x33c1elementwise vector-ALU (RecordHloCycles)
0x13r4VectorAlu1yes0x3401elementwise vector-ALU (RecordHloCycles)
0x14r3VectorAlu0yes0x3441vector-ALU lane0 (reduction-fn)
0x15r5VectorAluAnyyes0x39c1vector-ALU any (sublane reduce-window)
0x16r5VectorAluAnyyes0x3981pack-and-store / reduce-window / conv-kernel
0x17r2Xluyes0x9548broadcast / cross-lane (Broadcast::RecordCostCycles)
0x18r6VectorEupyes0x3f81EUP / transcendental (RecordHloCycles)
0x19r5VectorAluAnyyes0x3681vector-ALU any (RecordConvolutionCycles)
0x1ar6VectorEupyes0x3f41EUP / transcendental (RecordHloCycles)
0x1br2Xluyes0x9608cross-lane reduce-window
0x1cr2Xluyes0x94c8conv-kernel cross-lane
0x1dr2Xluno0x01unpriced
0x1er2Xluno0x01unpriced
0x1fr2Xluyes0x9588spatial-conv window-MXU
0x20r5VectorAluAnyyes0x39c1vector-ALU any (shares 0x39c with 0x15)

The four 8-cycle Xlu cells (0x17/0x1b/0x1c/0x1f) are the cross-lane throughput ports (broadcast-sublane, reduce-window, conv-kernel-window, spatial-conv-window). The 1-cycle cells are vector-ALU (0x12..0x16, 0x19, 0x20) or VectorEup (0x18/0x1a). The Res column is the ResourceVector slot R[2]..R[6] (Xlu / VectorAlu0 / VectorAlu1 / VectorAluAny / VectorEup).

QUIRK — Instr 0x15 and 0x20 share offset 0x39c. Both read the same VectorAluAny cell (value 1); the producer chooses which ordinal to emit (sublane-reduce vs reshape/sparse-core) but they price identically. A reimplementation that assumes distinct offsets per ordinal will allocate a redundant cell.


The Viperfish Matprep-Stage Ordinal Binding

Matprep Opcodes FATAL in the Classifier

On Viperfish the matprep opcodes (0x97..0x9a) are not classified into a standalone Instruction ordinal — they hit the FATAL arm of GetViperfishInstruction (0x1c8a3300, opcode jump table 0xb43a104, default arm 0x1c8a3e6a). The matmul opcode 0x9b instead reads matmul_data_format() and indexes the VF matmul-format WORD table @0xa2d05c0, read byte-exact as {0xd4, 0xda, 0xf8, 0xfe, 0xe0, 0xe6, 0xec, 0xf2} (fmt 1..8 = f32/bf16/fp8e5m2/fp8e4m3/u8/s8/u4/s4):

MatmulDataFormatdtypeVFinstr ordinal (@0xa2d05c0)
1f320xd4
2bf160xda
3f8e5m2→bf160xf8
4f8e4m3→bf160xfe
5u80xe0
6s80xe6
7u40xec
8s40xf2

The Reservation-Table Producer

The matprep stages are produced by MxuLatencyTable::GetResourceUsage (0x1c8ae5c0), which VfCycleTable::GetCyclesForThroughput (0x1c89e2c0) wraps — confirmed in the decompile, where case 0 calls viperfish::MxuLatencyTable::GetResourceUsage(&result, ..., 212, 3, 0). GetResourceUsage dispatches on the Resource arg and the matmul ordinal, builds a Modifier key, and looks it up in a FlatHashMap<Modifier, std::array<int,19>> — the 19-entry array is indexed by MxuResource (SetReservations<...> writes array[MxuResource]=val with bound 0x13 = 19). The matprep 4-stage systolic-feed occupancy (r4..r7) is four of those nineteen MxuResource ports.

The four modifier keys, each built by SetReservations<...> in the ViperfishPerformance constructor:

Modifierkey fieldsrole
MatmulModifier{MatmulDataFormat}the pure matmul step
MatpushModifier{MatmulDataFormat, is_transpose, Msr}the matprep/matpush stages
MatresModifier(matmul-result key)matmul-result stages
VlxmrModifier(vector-load matrix-result key)vector-load matrix-result

The MatpushModifier is the matprep-opcode → reservation binding: a matprep stage's reservation is keyed by {MatmulDataFormat × transpose-of-gains × matpush-Msr-stage} → a 19-entry MxuResource reservation array. The matprep opcode carries a Modifier that indexes the table, not a standalone ordinal. This is the VF realization of the same job JF folds into a flat offset-LUT cell. See Matmul-Mode Modifiers and MXU Latency: VF.

QUIRK — the matprep cost representation migrated across gens. JF folds the transpose-of-gains into the matmul opcode (flat cell); PF folds matprep into the Latch ops; VF FATALs the matprep opcodes and uses the MatpushModifier-keyed array<19> reservation; GL/GF give each matprep variant a fixed binary-search perf row (with a smaller array<11> reservation). A single matprep cost model across gens mis-prices four of the five. See ../isa/slot-matprep-iar-latch.md.


Worked Examples

Three end-to-end traces, each exercising one of the three surfaces above.

IarsPerTensorCore on Viperfish

A Viperfish device constructs a ViperfishTarget, whose base chain calls the shared Target::Init (0x1d60fc20). Init has already obtained the viperfish_chip_parts.binarypb TpuSequencerParts (loaded via embed://). It reads vector_isa()+0x18 (the present byte) — 1 — then vector_isa()+0x14 = VectorIsa.f7 = 2, and stores it at Target+0x4a8. A later IarsPerTensorCore() returns 2; the IAR-setter helper bounds iar_value < 2 (IAR0/IAR1). No ViperfishTarget constructor writes +0x4a8 itself — the value is purely the proto field.

A JF/DF Broadcast Cost

An HLO broadcast routes through CostModel::RecordHloCycles to Broadcast::RecordCostCycles (0x136e6940), which emits the ordinal immediate Instr 0x17 and calls GetCyclesForThroughput(0x17). That reads PerformanceJf[offsetLUT[0x17]] = PerformanceJf[0x954] = 8 cycles, then ResourceVector::Acc(r2 = Xlu, 8.0). So a JF/DF broadcast costs 8 cross-lane (Xlu) cycles. An elementwise add instead emits Instr 0x12PerformanceJf[0x33c] = 1 cycle on r4 = VectorAlu1.

A VF bf16 Matprep Stage

A bf16 kVectorMatmul (opcode 0x9b, MatmulDataFormat = 2) routes through GetViperfishInstruction, whose 0x9b arm reads WORD[0xa2d05c0 + 1*2] = 0xda (the bf16 pure-matmul ordinal). Its matprep companion is produced by MxuLatencyTable::GetResourceUsage(0xda, Resource, gains) (0x1c8ae5c0): the 0xda branch builds MatmulModifier{DataFormat = 2} for the matmul and MatpushModifier{DataFormat, is_transpose, Msr = LatchOpcodeToMsr(0x97..0x9a)} for the prep stages, then FlatHashMap<MatpushModifier, array<int,19>>::find returns the 19-entry MxuResource reservation. The matprep-stage columns r4..r7 = {4, 12, 20, 28} — the four systolic-feed stages. The bf16 matprep stage's cost is the MatpushModifier-keyed array<19>, never a standalone ordinal.


ComponentRelationship
Target::IarsPerTensorCore 0x1d617280reads DWORD[Target+0x4a8] (= 2 all gens)
Target::Init 0x1d60fc20sole writer of +0x4a8 from VectorIsa.f7
TpuChipParts::DefaultsForVersion 0x20b1b040loads embed://...<version>_chip_parts.binarypb
CostModel::RecordHloCycles 0x130bbfe0emits the non-MXU Instr 0x11..0x20 ordinals
CycleTableInstruction 0x1c89ca80the MXU-only classifier (Instr 0x00..0x10)
GetViperfishInstruction 0x1c8a3300matprep opcodes 0x97..0x9a FATAL here
MxuLatencyTable::GetResourceUsage 0x1c8ae5c0the VF Modifier → array<19> matprep reservation

Cross-References