Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Profiling and Telemetry

All addresses and offsets on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build-id 89edbbe81c5b328a958fe628a9f2207d). Other versions will differ.

Abstract

libtpu carries a complete copy of the TensorFlow/XLA profiling stack — the same XSpace/XPlane/XLine/XEvent/XStat model that JAX, PyTorch-XLA, and TensorBoard already speak — plus a TPU-specific device-trace codec and an orthogonal core-state telemetry schema. A profile is captured on-device as a stream of fixed-width hardware trace packets, encoded into a compressed riegeli blob, decoded per chip family into proto2 TraceEntry messages, and shaped into the device XPlanes of one XSpace that leaves the library as a serialized blob through the PJRT Profiler extension. This section reconstructs every stage of that pipeline from the binary; this page is the map.

The whole subsystem splits into two deliberately disjoint formats. xprof (the time-series event trace) answers "what happened over this interval?" — its root is XSpace, it is produced by a tsl::profiler::ProfilerCollection of host and device sub-profilers, and it is pulled through the PJRT C-ABI. tpu_telemetry (the point-in-time state snapshot) answers "what is every core doing right now?" — its root is AllCoreStateSummaries, it is produced by the on-host xdb (TPU debugger) state server, and it is pulled over gRPC by Cloud TPU monitoring and the Megascale hang-detector. The two never cross at the wire: PLUGIN_Profiler_CollectData serves only XSpace and the two state-summary RPCs never emit XSpace. They meet only through the TpuProfilerControlListener, which registers the compiler/debug metadata that both the profiler (to stamp XEvent source locations) and the xdb server (to resolve a PC to an HLO location) draw upon.

This page owns the capture → encode → decode → xplane orientation, the per-DeviceType profiler dispatch, and the trace-point / payload taxonomy — the shape of the pipeline and the map into its 21 sibling pages. It does not re-derive their internals: the PJRT extension struct and lifecycle live on PJRT Profiler Extension (written); the bit-level wire format, the per-family payload field maps, the metadata-id catalogs, and the telemetry schema each have their own deep-dive page, linked below and never duplicated here.

For orientation, the contract this section reconstructs is:

  • Two formats, one boundary: xprof XSpace (event trace, PJRT C-ABI) vs tpu_telemetry AllCoreStateSummaries (state snapshot, gRPC) — and the TpuProfilerControlListener that is their only shared dependency.
  • The five-stage xprof pipeline: on-device fixed-16-byte trace packets → riegeli+zlib transport → per-family TraceCodecInterface::DecodeEntry → proto2 TraceEntryTpuXLineBuilder::AddEvent into device XPlanes of one XSpace.
  • The four-level xprof data model: XSpaceXPlaneXLineXEventXStat, with interned XEventMetadata/XStatMetadata, identical to upstream TF and extended by TPU event/stat ids.
  • The per-DeviceType codec dispatch: xprof::tpu::GetTraceCodec(asic_sw::DeviceIdentifiers, int) selects a chip-family codec (pxc/vfc/vlc/glc/gfc fixed-width; jxc legacy PerformanceTraceEntry), each with its own trace-point band layout.
Trace modeltensorflow.profiler.XSpaceXPlaneXLineXEventXStat (4-level event tree, interned metadata)
Capture formatfixed 16-byte (128-bit) LSB-first hardware trace packet, per chip family
Transportriegeli::ZlibReader<riegeli::StringReader> (zlib-inflated blob of packets)
Device decodexprof::tpu::DecodeTraceBuffers<TraceEntry> @ 0xf59ffa0 (pxc) — walks packets → TraceEntry
Codec selectorxprof::tpu::GetTraceCodec(asic_sw::DeviceIdentifiers, int) @ 0xf5a2900
Device sub-profilerxprof::tpu::TpuProfilerImpl::CollectData(XSpace*) @ 0xef34860
Backendtsl::profiler::ProfilerCollection (built by CreateProfilers @ 0x1cf50860)
xprof exportserialized XSpace bytes via PJRT Profiler extension (type 1) + legacy TpuProfiler_* C-ABI
Telemetry schemaplatforms/deepsea/jellyfish/xdb/tpu_telemetry/tpu_telemetry.proto (6 msgs, 2 enums) → AllCoreStateSummaries, gRPC-pulled
Run gatexprof::tpu::TpuProfilerControlListener @ GetOrCreateTpuProfilerControlListener 0xf332800

The Capture → Encode → Decode → XPlane Pipeline

The xprof half of the subsystem is a five-stage flow from hardware ring buffer to a serialized XSpace. Each stage is owned by a different sibling page; this section names the stages, the symbol at each boundary, and the handoff between them.

[1] HW trace ring buffer (per core, per sequencer)
      │  emits fixed 16-byte packets, LSB-first:
      │  [valid:1][started:1][trace_point_id:8][block_id:3|6][timestamp:48|45][payload @ bit 61 …]
      ▼
[2] riegeli ZlibReader<StringReader> inflate          (DecodeTraceBuffers @ 0xf59ffa0)
      │  one compressed blob → decompressed stream of packets
      ▼
[3] TraceCodecInterface::DecodeEntry  (per chip family; pxc DecodeEntry @ 0xf5af3a0)
      │  GetBits(1)valid · GetBits(1)started · GetBits(8)id → 111-entry jump table @ 0xab85bc0
      │  per-event Decode<Name>(): header + [TraceIdHeader 21/3/12] + typed payload; total-bit CHECK
      ▼
[4] proto2 TraceEntry  (oneof variant = the event; oneof tag @ +0x28)
      │
      ▼
[5] xprof::tpu TpuXLineBuilder::AddEvent → device-plane XEvent + XStats
      │  trace_point_id → XEventMetadata.name ; timestamp → XEvent offset/duration_ps ; scalars → XStat
      ▼
   one tensorflow.profiler.XSpace  ── serialized ──▶ PLUGIN_Profiler_CollectData buffer

[1] Capture — on-device trace points

Each TPU core's hardware emits trace events into per-core ring buffers as constant-size 16-byte packets, LSB-first. The packet begins with a 2-bit framing prefix (valid, started), an 8-bit trace_point_id, a 3-or-6-bit trace_point_block_id, and a 48-or-45-bit cycle-counter timestamp — a fixed 59-bit header inside a 61-bit framing+header envelope, after which a per-event fixed-width payload begins at bit 61. The valid bit is the empty-slot sentinel that lets a buffer drain without a count; started catches torn hardware writes. The bit-level layout, the framing semantics, and the dual decode/encode dispatch are the subject of TraceEntriesCoder; the per-band payload field maps are split across the four payload pages below.

[2] Encode / Transport — riegeli + zlib

The packed packets are never stored raw: the device trace blob is wrapped in a riegeli::StringReader, inflated through a riegeli::ZlibReader, read whole, then walked one 16-byte packet at a time. This is the container layer owned by riegeli Trace Container. The same riegeli/PutBits primitives encode packets on the producing side (the byte-exact inverse of decode).

[3] Decode — per-family codec

DecodeTraceBuffers<TraceEntry> iterates the decompressed stream calling the codec's virtual DecodeEntry per packet. DecodeEntry peeks the 2 framing bits and the 8-bit trace_point_id, dispatches through a 111-entry rel32 jump table (pxc @ 0xab85bc0) into an anonymous-namespace Decode<Name>(), which re-decodes the header, an optional 36-bit TraceIdHeader, and a typed payload, then validates the total consumed bit count against a hardcoded CHECK constant. The decoder mechanics are owned by TraceEntriesCoder; the resulting proto2 TraceEntry (oneof tag at +0x28) is the decode product.

[4] TraceEntry — the decoded proto

A TraceEntry is a proto2 message whose active oneof variant identifies the event. The wire trace_point_id (banded, gappy) is not the proto oneof field number (dense): decode indexes the jump table by the wire id, encode dispatches on the oneof field. This two-id-space pairing and the master id↔field registry are owned by TracePoints Master Registry.

[5] XPlane shaping — TraceEntry to XEvent

xprof::tpu turns each TraceEntry into a device-plane XEvent: TraceHeader.trace_point_id becomes the per-plane XEventMetadata.name (the enum string), TraceHeader.timestamp (a cycle counter) becomes the XEvent offset_ps/duration_ps, and the decoded variant scalars become XStats. The translation — what exactly lands in XEventMetadata.metadata and how scalars map to stats — is owned by TraceEntry → XEvent/XStat. Host events arrive on a separate path (see below) and fold into the same XSpace.

NOTE — the cycle-counter timestamp (48 or 45 bits) is the raw device clock, not picoseconds. The cycle→ps conversion factor (per-gen device clock rate) and the per-line timestamp_ns origin are applied downstream in TpuXLineBuilder, not in the codec. A reimplementation that treats the on-wire timestamp as picoseconds will be off by the clock period and will mis-compute the wrap interval. See TraceEntriesCoder.


The xprof Data Model

The decoded events land in the canonical TensorFlow profiling tree — identical across TF/JAX/PyTorch-XLA, which is why a TPU profile opens in stock TensorBoard. libtpu ships its own copy of the tsl::profiler builders (XPlaneBuilder @ 0x1cf4d9a0 for GetOrCreateLine, XLineBuilder::AddEvent @ 0x1cf4dc40), confirming the model is not merely imported but instantiated in-binary.

LevelRoleKey fieldsOwned by
XSpaceroot container for one profiling sessionlist of XPlanethis page (opaque blob)
XPlaneone device core or one host thread-groupid, name (/host:0, /device:TPU:0, …), XLine[], interned XEventMetadata[] + XStatMetadata[]XPlane / XStat / TraceMe
XLineone timeline within a plane (one core/stream)id, name, timestamp_ns origin, duration_ns, XEvent[]XPlane / XStat / TraceMe
XEventa point or duration eventmetadata_id (FK → XEventMetadata), offset_ps, duration_ps, XStat[]XEvent Metadata IDs
XStata key/value annotation on an eventmetadata_id (FK → XStatMetadata), variant value (int64/uint64/double/bytes/str-ref)XStat Metadata IDs

The two metadata lists per XPlane are interned tables: an XEvent carries only a metadata_id index, and the name/category live once in XEventMetadata; likewise stats. This is why the metadata-id catalogs are their own pages — the ids are the dictionary every consumer must share with the producer.

Host events vs device events

Two sources feed one XSpace. Host events come from tsl::profiler::TraceMe scopes on TPU-runtime threads (TpuCompile, TpuExecute, queue submission, megascale transport) and land on the /host:0 plane via the HostTracer/ThreadpoolProfilerInterface sub-profilers. Device events come from the per-core hardware ring buffers drained and decoded as above, landing on /device:TPU:N planes via TpuProfilerImpl. The TraceMe macro mechanics and the host-plane shaping are owned by XPlane / XStat / TraceMe.

GOTCHA — the device-trace trace_point_id space (banded hardware enum, gappy; the pxc decode jump table at 0xab85bc0 bounds the 8-bit wire id at 0x6e=110 — cmp $0x6e,%rax; ja <error> @ 0xf5af451 — with 64 handled indices and a DummyTracePoint variant) is a different namespace from the XEventMetadata.metadata_id space (a dense per-plane interning index). The codec's id is mapped to an XEventMetadata name string during stage 5; the XEvent.metadata_id is then allocated fresh per plane. A reimplementation that conflates the two will mis-key every device event. See XEvent Metadata IDs and TracePoints Master Registry.


The Backend: ProfilerCollection and Sub-Profilers

The xprof export path is driven by a tsl::profiler::ProfilerCollection — a vector of ProfilerInterface sub-profilers assembled by CreateProfilers @ 0x1cf50860 walking the global factory registry that each sub-profiler joins via RegisterProfilerFactory @ 0x1cf50780. The ProfilerCollection is the object the PJRT PLUGIN_Profiler handle owns and delegates Start/Stop/CollectData to; that delegation, the C-ABI lifecycle, and the handle layout are fully documented on PJRT Profiler Extension — this page does not repeat them.

Each backend op (ProfilerCollection::Start @ 0xf6a1640, Stop @ 0xf6a16c0, CollectData(XSpace*) @ 0xf6a1740) iterates the inner vector and calls the matching slot on every sub-profiler. Confirmed members:

Sub-profilerSideKey symbolAddr
xprof::tpu::TpuProfilerImpldevice — drains per-chip ring buffers, decodes, builds device planesTpuProfilerImpl::CollectData(XSpace*)0xef34860
xla::profiler::HostTracerhost — TraceMe/CPU events into /host:0 (factory in xprof::cpu)HostTracer::CollectData(XSpace*)0xf32fb40
tsl::profiler::ThreadpoolProfilerInterfacehost — threadpool dispatch eventsThreadpoolProfilerInterface::CollectData(XSpace*)0xf3326c0

CreateProfilers wraps each factory output in a ProfilerController for crash isolation; the full factory inventory is populated lazily across several static-init blocks and is not exhaustively enumerated (LOW confidence on completeness — same gap noted on the PJRT page).

The run gate

Device profiling runs concurrently with execution, mediated by the xprof::tpu::TpuProfilerControlListener singleton (GetOrCreateTpuProfilerControlListener @ 0xf332800). Each chip driver queries CanStartProfiler(TpuChipLocation, TpuChipProfiler*, int) @ 0xf3328c0 before opening its trace ring buffer and polls MustStopProfiler(TpuChipLocation) @ 0xf332a00 for a mid-run stop. This listener is also the shared PC→HLO metadata source that lets the xdb state server fill SequencerInfo.hlo_location — the one point where the trace and telemetry halves touch.


Per-DeviceType Codec Dispatch

There is no single device-trace codec: each TPU chip family has its own TraceCodecInterface<TraceEntry> (vtable slots DecodeEntry/EncodeEntry/GetMaxEntrySize/GetEntryPacketSize), selected at runtime by xprof::tpu::GetTraceCodec(asic_sw::DeviceIdentifiers, int) @ 0xf5a2900 from a util_registration::StaticMapBase factory keyed by the chip codename. The packet size (16 bytes) and the 61-bit framing+header envelope are universal; what differs per family is the header split (3-vs-6-bit block_id, 48-vs-45-bit timestamp) and the per-trace-point payload field maps. The selector and the producer/reader wiring are owned by Per-DeviceType Profiler Struct and kDeviceTypeInfo Producer / Readers.

FamilyCreateTraceCodecTraceEntry typeblock_idtimestampCodec page area
pxcplc::driver::profiler::CreateTraceCodec @ 0xf5af2c0 (pxc family)fixed-width TraceEntry348TraceEntriesCoder
vfcvfc::driver::profiler::CreateTraceCodec @ 0xf5f5da0fixed-width TraceEntry645Payload: vfc/vlc/gfc
vlcvlc::driver::profiler::CreateTraceCodec @ 0xf5d5180fixed-width TraceEntry348Payload: vfc/vlc/gfc
glcglc::driver::profiler::CreateTraceCodec @ 0xf6282e0fixed-width TraceEntry645Payload: vfc/vlc/gfc
gfcgfc::driver::profiler::CreateTraceCodec @ 0xf65ed00fixed-width TraceEntry645Payload: vfc/vlc/gfc
jxc(legacy path)jxc::PerformanceTraceEntryPayload: jxc Legacy

QUIRK — jxc uses a different TraceEntry type — asic_sw::driver::deepsea::jxc::PerformanceTraceEntry, decoded by its own DecodeTraceBuffers<PerformanceTraceEntry> instantiation — not the shared fixed-16-byte TraceEntry codec the five current families use. A reimplementation that assumes one packet schema across all gens will misparse jxc traces. The jxc DMA/HbmMux/brn_perf specifics are on jxc DMA / HbmMux / brn_perf and Payload: jxc Legacy.


The Trace-Point / Payload Taxonomy

The trace points per family (per-gen cardinality differs — 78–128 oneof variants depending on gen; pxc=99) are not a flat enum — they are organized into subsystem bands of contiguous trace_point_id ranges, with reserved gaps between bands that all dispatch to a common error label. The decode jump-table index span is wider than the variant count (pxc bounds the wire id at 0x6e=110, vlc at 0x8f=143, vfc at 0x5f=95, gfc at 0x64=100), since gappy bands leave reserved slots routed to the error label. The bands carve the device-trace space by hardware unit; each band's payload field maps live on a dedicated page so this opener stays a map, not a dump.

Bandtrace_point_id range (pxc)SubsystemPayload page
UHI0–10host-DMA / address translationPayload: UHI/OCI/ICI/DMA
OCI20–27on-chip interconnect enginePayload: UHI/OCI/ICI/DMA
ICI40–55inter-chip-interconnect / collective fabricPayload: UHI/OCI/ICI/DMA
TCS80–97TensorCore sequencer sync/control + throttlePayload: SparseCore Band
BC100–110BcFsm / Bcs / BcOci (SparseCore/broadcast) controllers — pxc top-level indices BcFsmChannelController0..10Payload: SparseCore Band
CMQ(sub-dispatch)command-queue / VPU DMA — reached within BC case bodies, not a distinct top-level wire idPayload: UHI/OCI/ICI/DMA
(reserved)11–19, 28–39, 56–79, 98–99unused — all → common error label 0xf5b032fn/a (decode rejects)
(out of range)≥ 111rejected by the cmp $0x6e,%rax; ja bound @ 0xf5af451n/a

A small set of cross-cutting render/timeline concerns sit above the raw bands: the ICR DMA-timeline derivation (ICR DMA-Timeline Band), DMA endpoint rendering (DMA Endpoint Rendering), and the v7x performance-counter lines (v7x Perf-Counters). Most variant messages also carry a shared TraceIdHeader (transaction_id:21, core_id:3, chip_id:12│14) immediately after the header — 36 bits on pxc, 38 bits on vfc/vlc/glc/gfc where chip_id widens to 14 — the per-transaction identity that lets a multi-packet DMA be stitched back together; some events (OCI read/write) carry three of them.

GOTCHA — the band ranges and reserved gaps are per family. The pxc ranges above are the worked example; vfc/vlc/glc/gfc shift band boundaries as trace-point cardinality grows (≈99 → ≈144 events). Drive band detection off the per-family decode jump table, never off a hardcoded pxc range. See TracePoints Master Registry.


Telemetry: The Orthogonal State Snapshot

tpu_telemetry.proto (package platforms_deepsea.jellyfish.xdb.tpu_telemetry, 6 messages, 2 enums, no imports) is not part of the xprof pipeline — it is a separate, pull-on-demand core-state snapshot produced by the on-host xdb state server, confirmed in the binary by the platforms_deepsea::jellyfish::xdb::tpu_telemetry::* proto symbols (e.g. AllCoreStateSummaries, CurrentCoreStateSummary, SequencerInfo, QueuedProgramInfo, TpuCoreIdentifier, TpuCoreOnChipProto) and the descriptor path platforms/deepsea/jellyfish/xdb/tpu_telemetry/tpu_telemetry.proto. Its root, AllCoreStateSummaries, is a map<int32 global_core_id, CurrentCoreStateSummary>; each CurrentCoreStateSummary carries repeated SequencerInfo (per hardware sequencer: pc, tag, tracemark, program_id, run_id, hlo_location) and a repeated QueuedProgramInfo launch queue. The full field-by-field schema, the producer/consumer graph (xdb debugger, Cloud TPU RuntimeMetricService.GetTpuRuntimeStatus, Megascale hang-detector), and the boundary against the companion hardware-telemetry protos are owned by tpu_telemetry.proto.

It carries core execution state only — not HBM bytes, temperature, watts, ICI BER, or ECC counters; those live in purpose-built companion protos (utilization_metrics.proto, power_metrics.proto, error_report.proto, …) outside this schema. The one gen-aware surface is the TpuCoreTypeProto / TpuSequencerTypeProto enum, which encodes the SparseCore evolution: SPARSE_CORE_V0 (sequencer + address handler) vs the current SPARSE_CORE (scalar sequencer + Tile-Access-Core + Tile-Execute-Core sequencers, ids 4/5/6).

NOTE — telemetry and xprof are produced by different runtime subsystems (xdb state server vs tsl::profiler ProfilerCollection) and consumed over different channels (gRPC pull vs PJRT C-ABI). They share only a small semantic overlap (both reference run_id/launch_id and an HLO location) and one implementation dependency (the TpuProfilerControlListener's PC→HLO metadata). They are never serialized into the same blob.


Section Map

The 21 sibling pages divide as follows. This page is the orientation; each link is the deep dive that owns its internals.

Entry and ABI

Capture and transport (the wire)

Payload field maps (per band/family)

Shaping into XSpace

Telemetry (orthogonal)


Cross-References