Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

kDeviceTypeInfo Spec-Constants

All addresses, offsets, and field values on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). Other versions differ.

Abstract

xprof::kDeviceTypeInfo (0x1C60480) is the profiler's master per-DeviceType spec table: a 17-entry array of a fixed 0x448-byte (1096-byte) struct, indexed directly by the xprof::DeviceType ordinal. Each entry packs the device's clocks (kHz), per-chip core/tile geometry, two DVFS frequency ladders, and roughly forty IEEE-754 doubles of per-generation hardware spec constants — peak compute by precision, effective and per-memory-space bandwidth, latency, voltage, and firmware power/thermal coefficients.

The table is a compile-time static const aggregate, not a runtime-populated structure. Its symbol _ZN5xprofL15kDeviceTypeInfoE has internal linkage (the mangled L), it lives in .lrodata inside a read-only-executable PT_LOAD segment, and a .rela.dyn scan finds zero relocations across its whole extent — so it has no pointer fields and no runtime writer. The bytes are frozen at link time. Identical-code-folding leaves up to 13 byte-identical copies of the array, one per template/pass instantiation that references it; every consumer reads its own copy through a GOT-relative movabs base plus idx * 0x448.

This page documents:

  • The per-DeviceType struct: the field-offset map, classified into clocks, geometry, peak-compute doubles, DVFS ladders, and the host-side-only spec block.
  • The producer: why the table is a frozen compile-time const independent of the chip_parts capability blobs, joined to them only at the DeviceType ordinal.
  • The roofline readers: which fields each profiler consumer reads, and how the device-capability stats are stamped onto the XPlane.
  • The DeviceType → codename binding: the ordinal-to-silicon-generation map that gives each struct entry a name.
Symbol_ZN5xprofL15kDeviceTypeInfoE @ 0x1C60480 (.lrodata, internal linkage)
Shape17 entries × 0x448 B = 18632 B; ends 0x1C64D48; up to 13 ICF copies
Producernone — compile-time static const; zero relocs, no runtime writer
Clock readersGtcSpanConverter (0xF2CB6E0, +0x04); GetJobInfoFromResponse (0xF2C9AC0, +0x04/+0x50/+0x2F8 ×1000)
Compute readerXProfTpuCostAnalysis::HandleConvolution (0xF58E7C0, +0x60/+0x78/+0x80)
Geometry readersHostCoreId::ToDeviceOrdinal (0xF69C580); GetCostAdjustmentFunction (0xF58EFC0, +0x2C4)
Codename bindingDeviceTypeFromDeviceIdentifiers (0xF6993A0); DeviceTypeString (0xF69C7C0)

The Per-DeviceType Struct

The struct is indexed by the DeviceType ordinal: a consumer computes base + ordinal * 0x448, after a bounds check ordinal < 0x11 (ud1 trap otherwise). The GtcSpanConverter constructor is the canonical reader and pins the stride byte-exact:

GtcSpanConverter::GtcSpanConverter(DeviceType dt):     // sub_F2CB6E0
    if (dt >= 0x11) BUG()                               //  17 entries
    base = GOT(0x224C2980) + (-8525971 .. )             //  -> kDeviceTypeInfo @ 0x1C60480
    converter[0] = base[dt * 274 + 1]                   //  274 int32 cols = 0x448; col 1 = +0x04 (kHz)
    converter[1..3] = 0

274 * 4 = 0x448, confirming 274 int32 columns per entry. The field-offset map below classifies each populated field. Status legend: C = a .text reader was disassembled; P = byte-exact value, no in-binary reader pinned; I = semantics inferred from per-generation value scaling against an independent source.

offtypefieldmeaning / consumer
+0x00i32core_multi_flagBYTE-read by GetJobInfoFromResponse (cmp BYTE,1); 1 on multi-core / SC gens
+0x04i32gtc_freq_khzGTC clock; GtcSpanConverter divisor + GetJobInfoFromResponse ×1000 → Hz
+0x08i32gtc_ts_width_bitsGTC timestamp width {48,45,64}; matches the trace-codec GetBits64 widths
+0x0Ci32cores_per_chipdivisor in ToDeviceOrdinal/HandleConvolution/GetCostAdjustment/utilization
+0x10/+0x14i32logical_devices_a/bToDeviceOrdinal idiv [+0x10 or +0x14] / [+0x0C]; +0x14 = SparseCore count on SC gens
+0x18/+0x1Ci32geom_c / tile_countper-chip multiplier / escalating tile-engine count
+0x20i32sc_present_flag1 on the 45-bit SC gens (DT10..13)
+0x28i32[8]dvfs_ladder_A_khz8-point frequency ladder, DT12-only populated
+0x50i32tensorcore_clk_khzTensorCore/compute clock; GetJobInfoFromResponse ×1000 → Hz
+0x58f64peak_bf16_per_LDper-LD/sustained bf16 rate; no in-binary reader
+0x60f64peak_flops_bf16per-precision peak (TFLOP/s); HandleConvolution + XPlane stat 0x62
+0x68f64peak_flops_int8_v7xv7x-only alternate int8 slot (1992.0)
+0x78f64peak_flops_int8/fp8≈2× +0x60; HandleConvolution (integer/fp8 element types)
+0x80f64peak_flops_int4/fp4≈4× +0x60; HandleConvolution (int4/fp4 element types)
+0xB8..+0xC8f64eff_hbm_bw_0..2effective HBM bandwidth (GB/s class); host-side only
+0xD0f64peak_hbm_bwHBM bandwidth; XPlane stat 0x63 (×1.073741824 GB→GiB) — byte-exact vs chip_parts HBM TB/s
+0xD8..+0xF0f64mem_latency_0..3latency/cycle class; host-side only
+0xF8..+0x130f64per-mem-space bandwidth+0x100/+0x108 (SRAM/VMEM bw, stats 0x66/0x67); +0x120/+0x128 (CMEM bw, stats 0x64/0x65, v4-only); +0xF8/+0x110/+0x118/+0x130 host-side only
+0x138..+0x150f64rate_b_0..3secondary throughput rate (v4+); host-side only
+0x178..+0x190f64rate_c_0..3secondary peak-rate table (per-precision); host-side only
+0x2B8i32packed_geompacked {a,b,a,b} byte descriptor; BYTE-read by ProcessCounter
+0x2C4i32megacore_flagGetCostAdjustment cmp BYTE[+0x2C4],1; XPlane stat 0x6B (has_megacore)
+0x2C8i32perf-counter-set maskConvertTpuTraceToXPlaneGetPerformanceCounterNames<28> (v7x)
+0x2D0i32[8]dvfs_ladder_B_khzsecond 8-point ladder, DT12-only populated
+0x2F8i32sparsecore_clk_khzSparseCore clock; GetJobInfoFromResponse ×1000 → Hz
+0x300/+0x308f64core_voltage/power_0/1voltage/power class; host-side only
+0x340i32sc_lane_count16 on SC gens; read at 18 SparseCore subscriber sites
+0x348/+0x350/+0x358i32perf-counter-set masksGetPerformanceCounterNames<28> (v7x)
+0x360..+0x378f64power/thermal coeffsConvertFirmwareTraceEntriesToXPlaneFirmwareEventBuilder → "power"/"temperature" stats
+0x380..+0x398u64firmware-event ulongsFirmwareEventBuilder m m m m args (+0x380 first GP arg)
+0x438/+0x440i32perf-counter-set bases+0x438 = ICR set → GetPerformanceCounterNames<12>; +0x440 = CMNUR/HBM set → <3> (v7x; nonzero DT12 only)

NOTE — the table carries no pointer fields (zero relocations across [0x1C60480, 0x1C64D48)). The device codename string and the trace-codec factory are keyed separately by the same captured device identity: DeviceTypeString's pointer array at 0x21772F00 (indexed by ordinal) for the name, and the per-family DeviceIdentifiers std::map factory for the codec. The ordinal selects the clock/spec (this struct); the PCI tuple selects the codec.

NOTE — the +0x438/+0x440 tail holds v7x perf-counter-set enum bases, not roofline doubles. Mapping the six GetPerformanceCounterNames call-site GOT displacements back through base + ordinal*0x448 resolves +0x438 to the <12>-set (ICR/router) base and +0x440 to the <3>-set (CMNUR/HBM) base — both nonzero on DeviceType 12 only, high dword zero (no pointer). The four <28> (TensorCore/SparseCore) sets are at +0x2C8/+0x348/+0x350/+0x358.

DVFS Ladders

The two 8-entry int32[8] ladders at +0x28 (A) and +0x2D0 (B) are kHz operating-point tables, populated only on DT12 (TPU v7x); older generations leave them all-zero and use the fixed scalar clocks at +0x50 and +0x2F8. No in-binary iterator over either ladder was found — the host DVFS/power model consumes them — so the core-vs-SC labeling is positional:

  • Ladder A (+0x28): {1.60, 1.70, 1.80, 1.90, 2.00, 2.05, 2.10, 2.20} GHz. Tops out above the v7x TensorCore clock (+0x50 = 1900 MHz) — the core/TensorCore-domain ladder with turbo/boost states. (I)
  • Ladder B (+0x2D0): {1.40, 1.50, 1.60, 1.75, 1.75, 1.80, 1.85, 1.90} GHz. Its repeated 1.75 GHz point equals the v7x SparseCore clock (+0x2F8 = 1750 MHz) — the SC/memory-domain ladder. (I)

The Producer: A Frozen Compile-Time Const

There is no runtime producer function. The symbol is a LOCAL OBJECT of size 18632 in section .lrodata (flags AMSl, no write bit), mapped by a PT_LOAD segment with p_flags = R E. Any runtime store to 0x1C60480 would fault in a read-only-executable segment, and a .rela.dyn scan returns zero relocations in the table's range — so it has no pointer fields and cannot be patched at load. The bytes are a hand-written / codegen'd static const DeviceTypeInfo[17] emitted by the C++ front end. The up-to-13 byte-identical copies (each referenced through a per-copy movabs base = copy_VMA − GOT(0x224C2980)) are identical-code-folding artifacts of the distinct template/pass instantiations.

The peak-compute doubles are an independent xprof spec table, not a byte-frozen copy of the chip_parts FlopsPerSecond constants. Cross-validation against the decoded chip_parts numbers shows the bf16 and all v5p slots diverge, while only the v6e int8/int4 slots coincide — because both derive from the same systolic_dim² × clk hardware pedigree, not because one was copied from the other:

slotkDeviceTypeInfo (TFLOP/s)chip_parts /1e12ratiomatch
v5p bf16 +0x60236.7197.01.20no
v5p int8 +0x78466.2394.01.18no
v5p int4 +0x80925.2788.01.17no
v6e bf16 +0x60946.7918.01.03no
v6e int8 +0x781835.01835.01.00yes
v6e int4 +0x803670.03670.01.00yes

So chip_parts.binarypb is the compiler/Target spec source and kDeviceTypeInfo is the parallel xprof profiler spec source; the two are joined only at the DeviceType ordinal and are not interchangeable.


Roofline Readers

The profiler does not evaluate a roofline in-binary; it stamps the roofline inputs — a device-capability stat record — onto the XPlane, from which the out-of-binary host xprof/roofline tool computes the ceiling. The readers:

  • Clocks → device-info. GetJobInfoFromResponse (0xF2C9AC0) checks core_multi_flag (+0x00), then lifts gtc_freq_khz (+0x04), tensorcore_clk_khz (+0x50), and sparsecore_clk_khz (+0x2F8) ×1000 into the Task proto's gtc_freq_hz/tensor_core_freq_hz/sparse_core_freq_hz fields. The decompiled site reads [base + 1096*ord + col] with col[1]=+0x04, col[20]=+0x50, col[190]=+0x2F8, each 1000LL *.
  • Peak compute → cost model. XProfTpuCostAnalysis::HandleConvolution (0xF58E7C0) starts at DBL_MAX, maps each conv operand's Shape::element_type through a 28-entry jump table to load +0x60 (bf16), +0x78 (int8/fp8), or +0x80 (int4/fp4), takes the running vminsd (the slowest operand precision dominates), then divides by the core count. This is the byte-exact proof that +0x60/+0x78/+0x80 are the per-precision peak-FLOPS table.
  • Geometry → ordinal / cost. HostCoreId::ToDeviceOrdinal (0xF69C580) divides logical_devices_a/b (+0x10/+0x14) by cores_per_chip (+0x0C) to map (host, core) to a device ordinal. GetCostAdjustmentFunction (0xF58EFC0) reads megacore_flag (+0x2C4, cmp BYTE,1) to enable AdjustCostForMegacoreFunction and cores_per_chip (+0x0C).
  • Device-capability XStats → XPlane. ConvertTpuTraceToXPlaneV2 stamps +0x60 (peak TFLOP/s, stat 0x62), +0xD0/+0x100/+0x108/+0x120/+0x128 (per-memory-space bandwidth, stats 0x630x67, each ×1.073741824 to convert base-10 GB/s to GiB/s), and +0x2C4 (has_megacore, stat 0x6B). ConvertTpuTraceToXPlane feeds six perf-counter-set bases to GetPerformanceCounterNames<N>: +0x2C8/+0x348/+0x350/+0x358 as the four <28> (TensorCore/SparseCore) sets, +0x438 as the <12> (ICR/router) set, and +0x440 as the <3> (CMNUR/HBM) set. ConvertFirmwareTraceEntriesToXPlane feeds the +0x360..+0x378 doubles and the +0x380..+0x398 ulongs into FirmwareEventBuilder, which stamps "power"/"temperature" stats.

QUIRK — the +0xD0 HBM-bandwidth member, after the ×1.073741824 conversion, is byte-exact to the chip_parts HBM bandwidth (v7x 3433 → 3686.2 GiB/s = 3.686 TB/s; v6e 1525.5 → 1638.0 GiB/s = 1.638 TB/s). The +0xB8 group is not the authoritative HBM-bandwidth stamp — it is a separate effective-bandwidth spec with no in-binary reader. A reimplementation reading +0xB8 as the HBM peak would be reading the wrong field.


DeviceType → Codename Binding

The kDeviceTypeInfo index is the 1-based xprof::DeviceType enum. DeviceTypeFromDeviceIdentifiers (0xF6993A0) maps a captured 12-byte PCI tuple (all vendor_id == 0x1AE0, Google) to the ordinal, and DeviceTypeString (0xF69C7C0) maps the ordinal to the public name (return DeviceTypeString[ord-1], default "Cloud TPU" for ord-1 > 0xC). The eight real silicon generations:

ordinalpublic namecodenamefamilyGTC clk (kHz)ts widthcompute clk (kHz)
3TPU v2Jellyfishjxc70000048700000
5TPU v3Dragonfishjxc70000048940000
7TPU v4Pufferfishpxc::pfc700000481050000
8TPU v4 LitePuffylitepxc::plc700000481050000
10TPU v5Viperfish (v5p)vxc::vfc800000451750000
11TPU v5 LiteViperlite (v5e)vxc::vlc800000451500000
12TPU v7x6acc60406 (gfc)gxc::gfc833000451900000
13TPU v6 LiteGhostlite (v6e)gxc::glc800000451750000

DeviceTypeFromDeviceIdentifiers matches each codename's kXxxChipIdentifiers tuple in turn (Jellyfish→3, Dragonfish→5, Puffylite→8, the three Pufferfish B0 SKUs→7, the four Viperlite SKUs→11, the two Viperfish SKUs→10) and dispatches the two Ghost families via IsGlc→13 and IsGfc→12. Ordinals 1/2/4/6/9/14..16 are the host-GPU plane and "Cloud TPU" placeholder/reserved slots (DeviceType 9 is a reserved 64-bit-timestamp, 1.333 GHz slot with no PCI tuple). DeviceTypeToHardwareType (0xF69C7A0) confirms the split: the eight named gens map to hardware-type 3 (TPU), the placeholders to 0/1, the GPU plane to 2.

NOTE — the profiler's DeviceType ordinal is a third numbering distinct from both TpuVersion (0-based internal) and TpuVersionProto (1-based wire). It is keyed off the captured PCI tuple, not off TpuVersion. Use the Codename Matrix to reconcile the three.


NameRelationship
GtcSpanConverter (0xF2CB6E0)canonical +0x04 GTC-clock reader; pins the 0x448 stride and the 17-entry bound
GetJobInfoFromResponse (0xF2C9AC0)lifts the three clocks ×1000 into the Task device-info proto
HandleConvolution (0xF58E7C0)reads +0x60/+0x78/+0x80 peak FLOPS for the cost model
DeviceTypeFromDeviceIdentifiers (0xF6993A0)PCI tuple → DeviceType ordinal → codename binding
FirmwareEventBuilder (0xF284D20)consumes the +0x360 power/thermal coefficient bundle

Cross-References

  • Codename Matrix — reconciles the DeviceType ordinal with TpuVersion and the codename roster
  • Per-DeviceType Struct — the full 0x448 field dump and unit/wrap-period proof
  • kDeviceTypeInfo Producer / Readers — the compile-time-const producer argument and the XPlane device-capability stamp path
  • v7x Perf-Counters — owns the +0x2C8/+0x348/+0x350/+0x358/+0x438/+0x440 perf-counter-set fields and the DeviceType == 12 gate
  • chip_parts.binarypb — the parallel compiler-side per-gen spec source; FlopsPerSecond cross-validated against +0x60/+0x78/+0x80