IarsPerTensorCore and the Non-MXU Cost Band
Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (BuildID md5
89edbbe81c5b328a958fe628a9f2207d, not stripped — full C++ symbols). Other versions differ. Every integer below was read out of.rodata(LUTs and embedded chip-parts protos) by hand and cross-checked against the IDA decompile; status is in the Confidence columns.
Abstract
This page consolidates three per-generation cost-model quantities that share one root: a count and a classifier surface, all keyed by the same CycleTable::Instruction enum the MXU throughput tables use. First, the IarsPerTensorCore value — the count of Index-Add Registers (IAR0/IAR1) per TensorCore that bounds the matpush/indexed-load addressing path and feeds the cost model. Second, the non-MXU CycleTable band (Instruction 0x11..0x20) on Jellyfish/Dragonfish — the vector-ALU, cross-lane (Xlu), and EUP cost cells whose values are pinned (on MXU Latency: JF / DF) but whose producer is documented here. Third, the Viperfish matprep-stage ordinal binding — the mechanism that turns a matprep opcode into a reservation array.
The unifying fact is that IarsPerTensorCore is not a code constant. It is DWORD[Target+0x4a8], written once by the shared Target::Init from a VectorIsa field of the embedded per-generation *_chip_parts.binarypb proto, loaded at runtime via the embed:// filesystem. Extracted from the embedded blobs, the value is 2 on every generation — the IAR file is gen-stable (IAR0/IAR1), matching the 1-bit IarField ceiling in the ISA encoding. The version pin above is therefore necessary but not sufficient: a chip whose proto differs would change the count.
For reimplementation, the contract is:
IarsPerTensorCore=DWORD[Target+0x4a8], accessorTarget::IarsPerTensorCore(0x1d617280); sole writerTarget::Init(0x1d60fc20) fromvector_isa()field 7; value 2 all gens.- The JF/DF non-MXU band
Instruction 0x11..0x20is emitted by the HLO-level cost model (CostModel::RecordHloCycles+cost_model_util::Record*) as direct ordinal immediates keyed on HLO opcode /PrimitiveType/ transfer role — there is no LLO→Instructionclassifier for it (the MXU classifierCycleTableInstructioncovers only0x00..0x10). - The VF matprep stages do not carry standalone classifier ordinals: the matprep opcodes
0x97..0x9aFATALinGetViperfishInstruction, and the reservation is produced byMxuLatencyTable::GetResourceUsagekeyed on aMatpushModifier{MatmulDataFormat, is_transpose, Msr}.
IarsPerTensorCore | DWORD[Target+0x4a8] = 2 all gens (IAR0/IAR1, gen-stable) |
| Accessor | Target::IarsPerTensorCore 0x1d617280 (return *(uint32*)(this+0x4a8)) |
| Sole writer | Target::Init 0x1d60fc20 from TpuSequencerParts::vector_isa() field 7 |
| Proto source | embed://tpu_chip_parts/<version>_chip_parts.binarypb, VectorIsa.f7 |
| ISA ceiling | 1-bit IarField (0x1ee3b380, >>13 & 1) ⇒ 2 IARs |
| Non-MXU band | CycleTable::Instruction 0x11..0x20 (vector-ALU / Xlu / EUP) |
| Non-MXU producer | CostModel::RecordHloCycles 0x130bbfe0 + cost_model_util::Record* (direct ordinal immediates) |
| VF matprep binding | MxuLatencyTable::GetResourceUsage 0x1c8ae5c0, FlatHashMap<MatpushModifier, array<int,19>> |
| VF matprep opcodes | 0x97..0x9a FATAL in GetViperfishInstruction 0x1c8a3300 |
IarsPerTensorCore — Source and Per-Gen Value
The Accessor and the Sole Writer
IarsPerTensorCore is a one-instruction accessor reading offset 0x4a8 of the Target object (298 * 4 = 0x4a8):
// xla::jellyfish::Target::IarsPerTensorCore @ 0x1d617280 (verified)
__int64 Target::IarsPerTensorCore(this) {
return *((unsigned int*)this + 298); // DWORD[Target + 0x4a8]
}
The sole writer of Target+0x4a8 is the shared Target::Init — no per-generation *Target constructor writes it. In the register-count block of Init, the field is read from the VectorIsa C++ sub-struct (embedded at TpuSequencerParts+0x1c), gated by the VectorIsa-present byte:
// xla::jellyfish::Target::Init @ 0x1d60fc20 (verified — the +0x4a8 store)
v93 = TpuSequencerParts::vector_isa(seqparts); // = &seqparts[+0x1c]
if ( *(uint8_t*)(v93 + 0x18) == 1 ) { // VectorIsa-present gate (seqparts+0x34)
v94 = *(uint32_t*)(v93 + 0x14); // IarsPerTensorCore (seqparts+0x30 = VectorIsa.f7)
*((uint32_t*)Target + 298) = v94; // store at Target+0x4a8 ← the sole writer
*((uint64_t*)Target + ...) = *(uint64_t*)(v93 + 0xc); // adjacent MxusPerTensorCore qword
}
TpuSequencerParts::FromProto packs the proto VectorIsa submessage (six int32 fields, proto field numbers 2..7) into the C++ struct, with field 7 landing at struct +0x30 → Target+0x4a8. So IarsPerTensorCore is the VectorIsa field-7 proto value, loaded at runtime by TpuChipParts::DefaultsForVersion via tsl::ReadBinaryProto("embed://tpu_chip_parts/<version>_chip_parts.binarypb").
The Extracted Per-Gen Values
The per-gen blobs are embedded uncompressed in .rodata. Parsing the VectorIsa submessage out of each blob (the f2=128, f3=8 prefix locates it; f7 is the IAR count) gives, byte-exact:
| Gen (codename) | lane_count f2 | sublane_count f3 | mxu_count f5 | xlu_count f6 | iar_count f7 = IarsPerTensorCore |
|---|---|---|---|---|---|
| v2 Jellyfish | 128 | 8 | 1 | 1 | 2 |
| v3 Dragonfish | 128 | 8 | 2 | 1 | 2 |
| v4 Pufferfish | 128 | 8 | 4 | 2 | 2 |
| v5p Viperfish | 128 | 8 | 4 | 3 | 2 |
| v6e Ghostlite | 128 | 8 | 2 | 2 | 2 |
| v7 6acc60406 | 128 | 8 | 2 | 2 | 2 |
So IarsPerTensorCore = 2 on every generation — the IAR file does not grow v2→v7. This is exactly consistent with the ISA-encoding ceiling: the SetIar slot's IarField accessor (0x1ee3b380) is >>13 & 1, a single bit, which can address only IAR0/IAR1. The VectorIsa field names are not inferred — TpuSequencerParts::FromProto (0x20b30700) validates each by name ("Invalid lane_count in vector_isa field", "Invalid sublane_count …", "Invalid mxu_count …", "Invalid xlu_count …", "Invalid iar_count …"), so f2=lane_count=128 (constant across all gens), f5=mxu_count (1/2/4/4/2/2), and f6=xlu_count (1/1/2/3/2/2) are CONFIRMED, not merely positional.
NOTE — this value is data, not code. The accessor and the 1-bit
IarFieldare immutable code; the value is a runtime proto field. The "2 all gens" claim holds for this binary's embedded chip-parts blobs only. A reimplementation must read it from the per-gen proto, not bake it in. See Chip-Parts Binarypb and ../isa/slot-matprep-iar-latch.md for the IAR value-field layout.
The Consolidated Per-Gen Register-File Table
IarsPerTensorCore is the last column of the per-gen register-file block at Target+0x498..+0x4a8, all written by the same Target::Init from the same chip-parts blobs. The four register counts are re-derived byte-exact from the embedded protos:
| Gen | SREG +0x498 | VREG +0x49c | VMREG +0x4a0 | PREG +0x4a4 | IARS +0x4a8 |
|---|---|---|---|---|---|
| v2 Jellyfish | 32 | 32 | 8 | 15 | 2 |
| v3 Dragonfish | 32 | 32 | 8 | 15 | 2 |
| v4 Pufferfish | 32 | 32 | 8 | 15 | 2 |
| v5p Viperfish | 32 | 64 | 16 | 14 | 2 |
| v6e Ghostlite | 32 | 64 | 16 | 14 | 2 |
| v7 6acc60406 | 32 | 64 | 16 | 14 | 2 |
(Field offsets byte-exact from Target::Init + RegisterCount; values from the embedded chip-parts protos. The SREG/VREG/VMREG/PREG order follows Init's seq-0 type dispatch.)
The JF/DF Non-MXU CycleTable Band (Instruction 0x11..0x20)
The Producer
The MXU classifier CycleTableInstruction (0x1c89ca80) maps only the MXU band — matmul opcodes 0x8d..0x96 and matprep/matpush opcodes 0x9b..0xa5 — into CycleTable::Instruction 0x00..0x10. There is no LLO→Instruction classifier for ordinals 0x11..0x20. Those ordinals are emitted by the HLO-level cost model as direct ordinal immediates (mov esi, 0xNN) keyed on HLO opcode, PrimitiveType, or memory-transfer role, then passed to GetCyclesForThroughput (JfCycleTable vtable +0x10) for the cycle value and ResourceVector::Acc(Resource, cycles) for the accumulator slot:
CostModel::RecordHloCycles(0x130bbfe0) — per-HLO entry; switches onHloOpcodeand emits0x12/0x13/0x14for elementwise vector ops.cost_model_util::Record*—RecordPackAndStCycles(0x13844740,0x16),RecordBroadcastSublaneChunkCycles(0x138448c0,0x17/0x1b),RecordMemXferCyclesImpl(0x13844e80);Broadcast::RecordCostCycles(0x136e6940,0x17); theRecord{Conv,DepthwiseConv}KernelCycles/RecordConvolutionCycles/SpatialMajorConvolution::CalculateWindowMxuCyclesemitters (0x16/0x19/0x1c/0x1f).
The Band Table
The offsetLUT (0xb438b70) and resLUT (0xb438aec) were re-read byte-exact; the cycle value is the priced PerformanceJf cell under the valid mask 0x19FFC0821. JF == DF for the entire band (none of these offsets is the DF-override +0x28/+0x2c).
Instr | Res | ResourceVector | priced | PerfOff | JF/DF cyc | emitting context |
|---|---|---|---|---|---|---|
0x11 | r6 | VectorEup | no | 0x0 | 1 | unpriced (SparseCore decomposer) |
0x12 | r4 | VectorAlu1 | yes | 0x33c | 1 | elementwise vector-ALU (RecordHloCycles) |
0x13 | r4 | VectorAlu1 | yes | 0x340 | 1 | elementwise vector-ALU (RecordHloCycles) |
0x14 | r3 | VectorAlu0 | yes | 0x344 | 1 | vector-ALU lane0 (reduction-fn) |
0x15 | r5 | VectorAluAny | yes | 0x39c | 1 | vector-ALU any (sublane reduce-window) |
0x16 | r5 | VectorAluAny | yes | 0x398 | 1 | pack-and-store / reduce-window / conv-kernel |
0x17 | r2 | Xlu | yes | 0x954 | 8 | broadcast / cross-lane (Broadcast::RecordCostCycles) |
0x18 | r6 | VectorEup | yes | 0x3f8 | 1 | EUP / transcendental (RecordHloCycles) |
0x19 | r5 | VectorAluAny | yes | 0x368 | 1 | vector-ALU any (RecordConvolutionCycles) |
0x1a | r6 | VectorEup | yes | 0x3f4 | 1 | EUP / transcendental (RecordHloCycles) |
0x1b | r2 | Xlu | yes | 0x960 | 8 | cross-lane reduce-window |
0x1c | r2 | Xlu | yes | 0x94c | 8 | conv-kernel cross-lane |
0x1d | r2 | Xlu | no | 0x0 | 1 | unpriced |
0x1e | r2 | Xlu | no | 0x0 | 1 | unpriced |
0x1f | r2 | Xlu | yes | 0x958 | 8 | spatial-conv window-MXU |
0x20 | r5 | VectorAluAny | yes | 0x39c | 1 | vector-ALU any (shares 0x39c with 0x15) |
The four 8-cycle Xlu cells (0x17/0x1b/0x1c/0x1f) are the cross-lane throughput ports (broadcast-sublane, reduce-window, conv-kernel-window, spatial-conv-window). The 1-cycle cells are vector-ALU (0x12..0x16, 0x19, 0x20) or VectorEup (0x18/0x1a). The Res column is the ResourceVector slot R[2]..R[6] (Xlu / VectorAlu0 / VectorAlu1 / VectorAluAny / VectorEup).
QUIRK —
Instr 0x15and0x20share offset0x39c. Both read the same VectorAluAny cell (value 1); the producer chooses which ordinal to emit (sublane-reduce vs reshape/sparse-core) but they price identically. A reimplementation that assumes distinct offsets per ordinal will allocate a redundant cell.
The Viperfish Matprep-Stage Ordinal Binding
Matprep Opcodes FATAL in the Classifier
On Viperfish the matprep opcodes (0x97..0x9a) are not classified into a standalone Instruction ordinal — they hit the FATAL arm of GetViperfishInstruction (0x1c8a3300, opcode jump table 0xb43a104, default arm 0x1c8a3e6a). The matmul opcode 0x9b instead reads matmul_data_format() and indexes the VF matmul-format WORD table @0xa2d05c0, read byte-exact as {0xd4, 0xda, 0xf8, 0xfe, 0xe0, 0xe6, 0xec, 0xf2} (fmt 1..8 = f32/bf16/fp8e5m2/fp8e4m3/u8/s8/u4/s4):
MatmulDataFormat | dtype | VFinstr ordinal (@0xa2d05c0) |
|---|---|---|
| 1 | f32 | 0xd4 |
| 2 | bf16 | 0xda |
| 3 | f8e5m2→bf16 | 0xf8 |
| 4 | f8e4m3→bf16 | 0xfe |
| 5 | u8 | 0xe0 |
| 6 | s8 | 0xe6 |
| 7 | u4 | 0xec |
| 8 | s4 | 0xf2 |
The Reservation-Table Producer
The matprep stages are produced by MxuLatencyTable::GetResourceUsage (0x1c8ae5c0), which VfCycleTable::GetCyclesForThroughput (0x1c89e2c0) wraps — confirmed in the decompile, where case 0 calls viperfish::MxuLatencyTable::GetResourceUsage(&result, ..., 212, 3, 0). GetResourceUsage dispatches on the Resource arg and the matmul ordinal, builds a Modifier key, and looks it up in a FlatHashMap<Modifier, std::array<int,19>> — the 19-entry array is indexed by MxuResource (SetReservations<...> writes array[MxuResource]=val with bound 0x13 = 19). The matprep 4-stage systolic-feed occupancy (r4..r7) is four of those nineteen MxuResource ports.
The four modifier keys, each built by SetReservations<...> in the ViperfishPerformance constructor:
| Modifier | key fields | role |
|---|---|---|
MatmulModifier | {MatmulDataFormat} | the pure matmul step |
MatpushModifier | {MatmulDataFormat, is_transpose, Msr} | the matprep/matpush stages |
MatresModifier | (matmul-result key) | matmul-result stages |
VlxmrModifier | (vector-load matrix-result key) | vector-load matrix-result |
The MatpushModifier is the matprep-opcode → reservation binding: a matprep stage's reservation is keyed by {MatmulDataFormat × transpose-of-gains × matpush-Msr-stage} → a 19-entry MxuResource reservation array. The matprep opcode carries a Modifier that indexes the table, not a standalone ordinal. This is the VF realization of the same job JF folds into a flat offset-LUT cell. See Matmul-Mode Modifiers and MXU Latency: VF.
QUIRK — the matprep cost representation migrated across gens. JF folds the transpose-of-gains into the matmul opcode (flat cell); PF folds matprep into the Latch ops; VF FATALs the matprep opcodes and uses the
MatpushModifier-keyedarray<19>reservation; GL/GF give each matprep variant a fixed binary-search perf row (with a smallerarray<11>reservation). A single matprep cost model across gens mis-prices four of the five. See ../isa/slot-matprep-iar-latch.md.
Worked Examples
Three end-to-end traces, each exercising one of the three surfaces above.
IarsPerTensorCore on Viperfish
A Viperfish device constructs a ViperfishTarget, whose base chain calls the shared Target::Init (0x1d60fc20). Init has already obtained the viperfish_chip_parts.binarypb TpuSequencerParts (loaded via embed://). It reads vector_isa()+0x18 (the present byte) — 1 — then vector_isa()+0x14 = VectorIsa.f7 = 2, and stores it at Target+0x4a8. A later IarsPerTensorCore() returns 2; the IAR-setter helper bounds iar_value < 2 (IAR0/IAR1). No ViperfishTarget constructor writes +0x4a8 itself — the value is purely the proto field.
A JF/DF Broadcast Cost
An HLO broadcast routes through CostModel::RecordHloCycles to Broadcast::RecordCostCycles (0x136e6940), which emits the ordinal immediate Instr 0x17 and calls GetCyclesForThroughput(0x17). That reads PerformanceJf[offsetLUT[0x17]] = PerformanceJf[0x954] = 8 cycles, then ResourceVector::Acc(r2 = Xlu, 8.0). So a JF/DF broadcast costs 8 cross-lane (Xlu) cycles. An elementwise add instead emits Instr 0x12 → PerformanceJf[0x33c] = 1 cycle on r4 = VectorAlu1.
A VF bf16 Matprep Stage
A bf16 kVectorMatmul (opcode 0x9b, MatmulDataFormat = 2) routes through GetViperfishInstruction, whose 0x9b arm reads WORD[0xa2d05c0 + 1*2] = 0xda (the bf16 pure-matmul ordinal). Its matprep companion is produced by MxuLatencyTable::GetResourceUsage(0xda, Resource, gains) (0x1c8ae5c0): the 0xda branch builds MatmulModifier{DataFormat = 2} for the matmul and MatpushModifier{DataFormat, is_transpose, Msr = LatchOpcodeToMsr(0x97..0x9a)} for the prep stages, then FlatHashMap<MatpushModifier, array<int,19>>::find returns the 19-entry MxuResource reservation. The matprep-stage columns r4..r7 = {4, 12, 20, 28} — the four systolic-feed stages. The bf16 matprep stage's cost is the MatpushModifier-keyed array<19>, never a standalone ordinal.
Related Components
| Component | Relationship |
|---|---|
Target::IarsPerTensorCore 0x1d617280 | reads DWORD[Target+0x4a8] (= 2 all gens) |
Target::Init 0x1d60fc20 | sole writer of +0x4a8 from VectorIsa.f7 |
TpuChipParts::DefaultsForVersion 0x20b1b040 | loads embed://...<version>_chip_parts.binarypb |
CostModel::RecordHloCycles 0x130bbfe0 | emits the non-MXU Instr 0x11..0x20 ordinals |
CycleTableInstruction 0x1c89ca80 | the MXU-only classifier (Instr 0x00..0x10) |
GetViperfishInstruction 0x1c8a3300 | matprep opcodes 0x97..0x9a FATAL here |
MxuLatencyTable::GetResourceUsage 0x1c8ae5c0 | the VF Modifier → array<19> matprep reservation |
Cross-References
- MXU Latency: JF / DF — the JF/DF throughput table that prices this non-MXU band, and the
LatencyTableJellyfishcopy map. - JfCycleTable — the full
offsetLUT/resLUTtranscription and the MXU classifier LUTs. - Resource Enum — the 23-slot
ResourceVectorwhose slots R[2]..R[6] this band occupies. - Matmul-Mode Modifiers — the VF
MatpushModifier-keyedarray<int,19>reservation table in full. - MXU Latency: VF — the Viperfish reservation model that wraps
GetResourceUsage. - VfCycleTable — the VF throughput reader that dispatches into
MxuLatencyTable::GetResourceUsage. - ../isa/slot-matprep-iar-latch.md — the IAR value-field layout, the 1-bit
IarFieldencoding, and the per-gen matprep cost divergence. - ../targets/chip-parts-binarypb.md — the embedded per-gen
chip_parts.binarypbproto and theembed://load path that suppliesIarsPerTensorCore. - Consolidated Per-Gen Counts (overview) — the cost-model index that links the per-gen Performance/CycleTable pages.