Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

TraceEntriesCoder

All addresses and offsets on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build-id 89edbbe81c5b328a958fe628a9f2207d). The binary is not stripped — full C++ symbols are present, and .text VMA equals file offset. Other versions will differ.

Abstract

TraceEntriesCoder is the on-device profiler trace-entry codec: the fixed-width, LSB-first bit format that every TPU hardware trace event packs into, and the per-chip-family decoder/encoder that translates between that wire packet and a proto2 TraceEntry message. It is the bottom-most layer of the xprof device-trace pipeline — the stage between the compressed riegeli container and the TraceEntry → XEvent/XStat shaping. Where a route-cache record is a self-delimiting varint stream, a profiler trace event is the opposite: a constant 16-byte (128-bit) packet with a 2-bit framing prefix, a 56- or 59-bit header (59 for pxc/vfc/glc/gfc, 56 for vlc), an optional 36/38-bit transaction-identity sub-record (36 on pxc, 38 on vfc/vlc/glc/gfc), and a per-event fixed-width payload, all read with the shared GetBits64/SkipBits bit-codec primitives.

The codec has a deliberately asymmetric two-id-space dispatch, and getting it right is the whole reimplementation. Decode peeks the 2 framing bits and the 8-bit trace_point_id — the banded hardware enum value, gappy, max handled 0x95 (149) for pxc plus a 0xff dummy — and indexes a pair of contiguous rel32 jump tables (the first @ 0xab85bc0 bounded at 0x6e, the second @ 0xab85d7c for ids 0x6f..0xff) to reach an anonymous-namespace Decode<EventName>(). Encode dispatches on the dense proto oneof field number stored at TraceEntry+0x28 through a parallel jump table. The two id spaces are not interchangeable; the registry that pairs them is owned by TracePoints Master Registry, and a reimplementation that drives encode off the wire id (or decode off the oneof field) mis-keys every event.

This page owns the codec format and dispatch: the 16-byte packet, the framing/header/TraceIdHeader bit layout, the per-event total-bit CHECK validation, the per-family CreateTraceCodec/GetTraceCodec factory wiring, and how the header split shifts across silicon generations. It does not own the per-band payload field maps (UHI/OCI/ICI/DMA, SparseCore band, vfc/vlc/gfc, jxc legacy), the trace-point id registry (master registry), the compressed container (riegeli), or the XEvent translation (trace-entry-to-xevent).

For reimplementation, the contract is:

  • The fixed 16-byte packet and the bit-stream reader — LSB-first, GetBits64NoInline(n) for an n-bit field, SkipBitsNoInline(n) to advance, masked by mask_[k] = (1<<k)-1.
  • The 2-bit framing prefix + 56/59-bit TraceHeader + optional 36/38-bit TraceIdHeader — the envelope, with the per-family block_id/timestamp split (pxc/vfc/glc/gfc = 59-bit header; vlc = 56-bit) and the chip_id 12→14 widening at Viperfish.
  • The dual dispatch — decode by 8-bit wire trace_point_id (DecodeEntry jump table), encode by dense oneof field number (EncodeEntry jump table); the valid/started framing semantics; the per-event total-bit CHECK.
  • The per-family factoryGetTraceCodec(DeviceIdentifiers, int) selecting one of pxc/vfc/vlc/glc/gfc fixed-width codecs (or the jxc legacy PerformanceTraceEntry path) from a static type-factory keyed by chip codename.
Codec interfaceasic_sw::driver::deepsea::profiler::TraceCodecInterface<TraceEntry> (abstract; vtable: DecodeEntry/EncodeEntry/GetMaxEntrySize/GetEntryPacketSize)
Packet sizefixed 16 bytes (128 bits)GetEntryPacketSize()==0x10, GetMaxEntrySize()==0x20 (decoded-proto upper bound), all 5 families
Bit orderLSB-first; read via BitDecoder::GetBits64NoInline @ 0x21073760, SkipBitsNoInline @ 0x21073580, mask table mask_ @ 0xbe79440
Header layoutvalid:1 · started:1 · trace_point_id:8 · block_id:3│6 · timestamp:48│45 — framing+header = 61 bits for pxc/vfc/glc/gfc (payload @ bit 61); vlc is 8/3/45 → 56-bit header, payload @ bit 58
Decode entrypxc TraceEntriesCoder::DecodeEntry @ 0xf5af3a0 → two contiguous rel32 jump tables: 111-entry @ 0xab85bc0 (ids 0..0x6e) + 145-entry @ 0xab85d7c (ids 0x6f..0xff)
Encode entrypxc TraceEntriesCoder::EncodeEntry @ 0xf5c5e60 → parallel jump table @ 0xab85fc0
Decode driverxprof::tpu::DecodeTraceBuffers<TraceEntry> @ 0xf59ffa0 (pxc) — inflates + walks 16-byte packets
Codec selectorxprof::tpu::GetTraceCodec(asic_sw::DeviceIdentifiers, int) @ 0xf5a2900
Factorypxc::driver::profiler::CreateTraceCodec (plc symbol @ 0xf5af2c0); vfc @ 0xf5f5da0, vlc @ 0xf5d5180, glc @ 0xf6282e0, gfc @ 0xf65ed00
Source paths (rodata)platforms/asic_sw/driver/deepsea/<fam>/profiler/trace_entries.proto, …/trace_codec_factory.cc; third_party/gloop/util/coding/bitcoding.cc

The Fixed 16-Byte Packet

Every on-device profiler trace event is a constant-size 16-byte (128-bit) packet — not a varint, gamma, or any self-delimiting record. This is the single most important divergence from the rest of the gloop bit-codec toolkit, which elsewhere uses varint-framed records: the profiler path is pure fixed-width.

The constant size is proven three ways in the binary:

  • GetEntryPacketSize() returns 0x10 (16) in all five families — each is a single mov $0x10,%eax; ret at pxc 0xf5d4ec0, vfc 0xf628020, vlc 0xf5f5ae0, glc 0xf65ea40, gfc 0xf697aa0.
  • GetMaxEntrySize() returns 0x20 (32) at the matching -0x20 addresses — this is the decoded proto's in-RAM upper bound, not the wire size. A reimplementer must not confuse the two: 16 bytes on the wire, ≤32 bytes as a deserialized TraceEntry.
  • Each per-event decoder gates on input length before reading: cmp $0xf,%rdx; ja … requires the string_view to be longer than 15 bytes (i.e. at least one whole 16-byte packet), and on success records bytes-consumed = 0x10 (movq $0x10,0x8(%rbx)).

The packet is an LSB-first bit-stream read over the shared BitDecoder window ({cursor@+0x8, end@+0x10, buffer@+0x18, bits_avail@+0x20}, a 0x28-byte object). In the decompiled DecodeEntry, the window is stack-constructed inline from the input string_view:

// DecodeEntry @0xf5af3a0 — window init (decompiled v14[] = the BitDecoder)
v14[1] = data_ptr;          // +0x8  cursor   = start of the 16-byte packet
v14[0] = data_ptr;          // +0x0/+0x18 buffer base (LSB-first source)
v14[2] = data_ptr + length; // +0x10 end
v15    = 0;                  // +0x20 bits_avail

Fields are extracted with two primitives from bitcoding.cc:

PrimitiveAddressEffect
BitDecoder::GetBits64NoInline(dec, n, out)0x21073760read the next n bits (LSB-first) into *out, masked by mask_[n]; advance the cursor n bits
BitDecoder::SkipBitsNoInline(dec, n)0x21073580advance the cursor n bits without materializing them
mask_[k] = (1<<k)-10xbe7944065-qword .rodata mask table; +0x8mask_[1], +0x18mask_[3], +0x40mask_[8], +0xa8mask_[21], +0x100mask_[32], +0x180mask_[48] (offsets verified against the field widths)

NOTE — GetBits64NoInline faults (ud1) on n > 64 and handles the within-buffer straddle internally; the codec never asks for more than 48 bits in a single call. The NoInline variants are the out-of-line copies; the same logic is inlined at thousands of call sites across the per-event decoders.


The Framing Prefix and the TraceHeader

Every packet opens with a 2-bit framing prefix and a TraceHeader, together a 59- or 61-bit envelope depending on family, after which the typed payload begins.

 bit  field                 width                    proto
 ───  ────────────────────  ───────────────────────  ──────────────────────────────
  0   valid                 1                        (framing — 0 ⇒ end of buffer)
  1   started               1                        (framing — valid&&!started ⇒ error)
 2-9  trace_point_id        8                        TraceHeader.trace_point_id  (f1)
10-K  trace_point_block_id  3 (pxc/vlc) │ 6 (vfc/glc/gfc)   TraceHeader.f2
K+1   timestamp             48 (pxc) │ 45 (vfc/vlc/glc/gfc)  TraceHeader.f3
      (pxc/vfc/glc/gfc header ends at bit 60 ⇒ framing+header = 61, payload @ bit 61)
      (vlc header ends at bit 57 ⇒ framing+header = 58, payload @ bit 58)

The header sub-budget (id + block_id + timestamp) is 59 bits for pxc/vfc/glc/gfc but 56 bits for vlc (byte-confirmed: vlc DecodeTraceHeader @0xf5f5b40 reads 8 / 3 / 45, not 8 / 3 / 48). For the 59-bit families, newer silicon widens block_id from 3 to 6 bits and narrows the timestamp from 48 to 45 bits to keep the envelope constant — so for those four the payload origin stays at bit 61. vlc is the outlier: it keeps the 3-bit block_id and the 45-bit timestamp, so its header is only 56 bits and its payload begins at bit 58 (vlc Encode<…> @0xf5f3700 confirms the inverse: block<<10, ts(45)<<13, payload<<58). A cross-gen reimplementation must derive the timestamp width and payload origin per family from the family's DecodeTraceHeader, never assume a fixed 59-bit envelope.

Framing semantics — valid and started

The two framing bits are not part of the proto; they are wire-level flow control:

  • valid (bit 0) is the empty-slot sentinel. A valid == 0 packet means graceful end of stream — the buffer can be drained to its capacity without an explicit entry count; the decoder stops and reports success with *started_out = 0. This is why a ring buffer can be over-allocated and read until the first cleared slot.
  • started (bit 1) catches torn/partial hardware writes. valid && !started returns an INVALID_ARGUMENT status: the decoder builds MakeErrorImpl<3>("Found a valid but not started packet.") (string @ 0x9ff8a9c; the <3> template arg is the INVALID_ARGUMENT status code, and 0x2e02 is the source-location line passed to AddSourceLocationImpl).

The decompiled head of DecodeEntry shows the exact read order and branch logic:

// pxc TraceEntriesCoder::DecodeEntry @0xf5af3a0
GetBits64NoInline(dec, 1, &valid);     // slot -0x58 (bit 0)
GetBits64NoInline(dec, 1, &started);   // slot -0x50 (bit 1)
GetBits64NoInline(dec, 8, &id);        // slot -0x48 (bits 2..9) — peek, then re-decode in handler

if (valid) {
    if (started) {
        *valid_out   = 1;
        *started_out = 0;              // (sic — set then overwritten by handler success path)
        if (id <= 0x6e) {              // first jump table @0xab85bc0 (111 entries, ids 0..0x6e)
            switch (id) {
                case 0:  DecodeUhiHostDmaTransactionStartedAddressTranslation(...); break;
                case 1:  DecodeUhiHostPhysicalRequestRead(...);                     break;
                // … 0-6 UHI, 7-10/20-27 OCI, 40-48 ICI, 49-55 OCI, 80-90 TCS, 91-96 OCI-from-TCS, 97 throttle …
                default: goto second_table;   // reserved slots fall to @0xf5b032f
            }
        } else {                       // second jump table @0xab85d7c (ids 0x6f..0xff)
        second_table:                  // @0xf5b032f: id -= 0x6f; if (id > 0x90) goto unhandled;
            switch (id) {
                // … 100-134 BcFsm/Bcs/BcOci (SparseCore), 140-149 CMQ, 0xff dummy …
                default: goto unhandled;       // @0xf5b0b33: bytes_consumed=0, return OK
            }
        }
    } else {
        return MakeErrorImpl<3>("Found a valid but not started packet.");  // @0x9ff8a9c
    }
} else {
    *valid_out = 0;                    // EOS — graceful, status OK
    return OK;
}

GOTCHA — DecodeEntry reads the 8-bit id only to index the jump table. Each Decode<Name>() handler then re-decodes the packet from bit 0 — SkipBits(2) past the framing, a full DecodeTraceHeader, then the typed payload — because the handler needs the header fields materialized into the proto, not just the id peeked. A reimplementation that tries to share one header decode between the dispatcher and the handler will double-consume or mis-position the cursor.

Per-gen header widths

DecodeTraceHeader is an anonymous-namespace helper, one per family, that reads exactly three GetBits64 calls in order and stamps the TraceHeader proto. The pxc version is byte-confirmed in the decompile:

// pxc DecodeTraceHeader @0xf5d4f20
GetBits64NoInline(dec, 8,  &id);    th = Arena::DefaultConstruct<TraceHeader>();  th[+0x18]=id;    th[+0x10] |= 1;
GetBits64NoInline(dec, 3,  &block); th[+0x1c]=block;                              th[+0x10] |= 2;
GetBits64NoInline(dec, 48, &ts);    th[+0x20]=ts;                                 th[+0x10] |= 4;

The |= 1 / 2 / 4 are the proto2 has-bits (presence byte at TraceHeader+0x10, bit0=id, bit1=block_id, bit2=timestamp). The widths per family:

FamilyDecodeTraceHeaderidblock_idtimestampheader bitspayload start
pxc0xf5d4f2083485961
vfc0xf62808086455961
vlc0xf5f5b4083455658
glc0xf65eaa086455961
gfc0xf697b0086455961

QUIRK — the timestamp is the raw device cycle counter (48 or 45 bits), not picoseconds. The cycle→ps conversion (per-gen device clock rate) and the per-line timestamp_ns origin are applied downstream in TpuXLineBuilder, never in the codec. A reimplementation that treats the on-wire timestamp as picoseconds is off by the clock period and will mis-compute the counter wrap interval (≈2^48 cycles at 48 bits). See TraceEntry → XEvent/XStat.


The TraceIdHeader Sub-Record

Most variant messages carry a TraceIdHeader immediately after the header — the per-transaction identity that lets a multi-packet DMA be stitched back together. It is decoded inline (there is no standalone DecodeTraceIdHeader symbol) into a nested proto2 TraceIdHeader submessage. Its width is 36 bits on pxc (chip_id = 12) but 38 bits on vfc/vlc/glc/gfc (chip_id widens to 14 at Viperfish); transaction_id (21) and core_id (3) are constant across all families.

 rel bit  field            width        proto
 ───────  ───────────────  ───────────  ────────────────────────────────────────
  0-20    transaction_id   21           TraceIdHeader.transaction_id (f1)
 21-23    core_id          3            f2 — enum {TC0,TC1,BC0..BC3,NONCORE,…} (8 values ⇒ 3 bits)
 24-..    chip_id          12 (pxc) │ 14 (vfc/vlc/glc/gfc)   f3
 = 36 bits (pxc) │ 38 bits (vfc/vlc/glc/gfc), immediately after the header when present

Byte-confirmed in pxc DecodeUhiHostPhysicalRequestRead @ 0xf5b0f20, where the three opening GetBits64 calls are 21 / 3 / 12; and in the newer-gen DecodeScMessageOutboundInternalMessage (vfc) @ 0xf60e6c0, where they are 21 / 3 / 14:

// DecodeUhiHostPhysicalRequestRead @0xf5b0f20 — TraceIdHeader, then payload
GetBits64NoInline(dec, 21, &transaction_id);   // TraceIdHeader f1
GetBits64NoInline(dec,  3, &core_id);          // f2 (3-bit enum)
GetBits64NoInline(dec, 12, &chip_id);          // f3
// then the typed payload: 1, 30, 1, 1, 29, 26, 8, 20, 20 …

The 3-bit core_id exactly fits the 8-value CORE_ID enum (RESERVED/NONCORE/TC0/TC1/BC0..BC3). Some events carry multiple TraceIdHeaders — OCI read/write commands embed three (cmd0/cmd1/cmd2), i.e. 3 × 36 = 108 bits of identity before any other payload. The per-band detail of which events carry one, three, or zero is owned by the payload pages.


The Dual Dispatch

The codec is deliberately asymmetric: decode and encode index different jump tables keyed by different id spaces.

Decode — by 8-bit wire trace_point_id

DecodeEntry dispatches on the 8-bit on-wire trace_point_id — the banded hardware enum, gappy, max handled 0x95 = 149 for pxc (plus a 0xff dummy) — through two contiguous rel32 jump tables: the first @ 0xab85bc0 (111 entries) covers ids 0..0x6e (cmp $0x6e; ja), and a default-path second table @ 0xab85d7c (145 entries, reached at 0xf5b032f where id -= 0x6f; cmp $0x90; ja) covers ids 0x6f..0xff. The tables are read byte-exact; their band structure mirrors the trace-point registry:

Bandtrace_point_id (pxc)Subsystem
UHI0–6host-DMA / address translation
OCI7–10, 20–27, 49–55, 91–96 (OCI-issued-from-TCS)on-chip interconnect engine
ICI40–48inter-chip interconnect / collective fabric
TCS80–90TensorCore sequencer sync/control (sync/fence/instruction + interrupt)
Throttle97thermal/electrical throttle state
BC (SparseCore/broadcast)100–134BcFsm / Bcs / BcOci controllers
CMQ140–149command queue / VPU DMA
Dummy255 (0xff)DummyTraceEntryDummyTracePoint
(reserved)11–19, 28–39, 56–79, 98–99, 135–139, 150–254unhandled → graceful return

GOTCHA — the reserved id ranges do not fall through to a neighbour handler. Reserved slots in the first table (ids ≤ 0x6e) jump to 0xf5b032f, which is not an error label — it is the entry to the second dispatch table; both tables' out-of-range / unfilled slots ultimately land at 0xf5b0b33, which sets bytes_consumed = 0 and returns OK (an unhandled id is silently skipped, not a status error). This confirms the band gaps are deliberate reserved space, not a decode bug. Drive band detection off the per-family jump table contents, never off a hardcoded pxc range; the band boundaries shift per generation as trace-point cardinality grows (99 handled cases for pxc → 144 for gfc).

Encode — by dense oneof field number

EncodeEntry @ 0xf5c5e60 dispatches on the dense proto oneof field number held at TraceEntry+0x28 (the proto2 oneof case), through a parallel rel32 table at 0xab85fc0:

// EncodeEntry @0xf5c5e60 — dispatch + inlined header pack
field = *(int*)(entry + 0x28);          // proto oneof case (dense)
idx   = field - 2;                       // table indexed by field-2
if (idx > 0x62 /*pxc bound*/) default;
goto *jumptable_0xab85fc0[idx];          // → Encode<Name>()

// inlined header pack (Encode<Name>, word0):
word0  =  (mask_[1] & flag) * 3;         // flag*3 ⇒ bits 0 and 1 both set (valid=1, started=1)
word0 |=  id    << 2;                     // trace_point_id
word0 |=  block << 10;                    // block_id
word0 |=  ts    << 13;                    // timestamp (pxc/vlc) — vfc/glc/gfc shift <<16 (6-bit block)
word0 |=  payload_lo << 61;               // payload @ bit 61 (pxc/vfc/glc/gfc); vlc shifts <<58
// word1 = remaining payload; two qwords (16 B) written to the encoder SSO buffer

The encode shifts are the byte-exact inverse of the decode GetBits widths: <<2 (after the 2 framing bits), <<10 (after id), then the block/timestamp split is family-dependent — pxc/vlc shift ts<<13 (after a 3-bit block), vfc/glc/gfc shift ts<<16 (after a 6-bit block) — and the payload origin is <<61 for pxc/vfc/glc/gfc but <<58 for vlc (vlc's 45-bit timestamp + 3-bit block ⇒ 56-bit header; byte-confirmed at Encode<…> @0xf5f3700). The framing is written as flag*3mask_[1] & 1 then lea(r8,r8,2) — which sets both bit 0 and bit 1, i.e. valid=1, started=1 for every real entry.

The two id spaces, paired

The wire id and the oneof field are distinct namespaces; the handler stamps the oneof field via movl $field,0x28(%entry). Worked, byte-confirmed pairs:

trace_point_id (wire)eventoneof field (proto)encode bound cmp
0UhiHostDmaTransactionStartedAddressTranslation2
1UhiHostPhysicalRequestRead3
40IciPacketPacketReceivedOnLinkInput21 (0x15)
81TcsInternalSetSyncFlag38 (0x26)
97ThrottleStateThermalAndElectrical54 (0x36)

Per-family oneof-field encode bounds (the cmp $imm before the jump): pxc 0x62, vlc 0x4d, vfc 0x79, glc 0x7f, gfc 0x7f. The full id↔field registry is owned by TracePoints Master Registry.

QUIRK — decode max handled wire id (0x95 pxc, spread across two jump tables split at 0x6e/0x6f) and encode max oneof field (0x62 pxc, after the field-2 bias) differ because the wire id is banded with gaps while the oneof field is dense. Both count the same set of events (99 for pxc); only the indexing differs. A reimplementation that sizes one table to the other's bound will overrun or truncate.


The Per-Event Decode Contract and the Total-Bit CHECK

Each Decode<Name>(string_view, bool* started_out, TraceEntry* out) follows the same shape:

// generic Decode<EventName> contract (e.g. DecodeUhiHostPhysicalRequestRead @0xf5b0f20)
function Decode<Name>(view, started_out, entry):
    if view.length <= 0xf: return error           // need a full 16-byte packet
    BitDecoder dec(view);
    SkipBits(dec, 2);                              // skip the 2 framing bits
    DecodeTraceHeader(entry, dec);                 // id/block/timestamp into TraceHeader
    [ DecodeTraceIdHeader inline: 21/3/12 ]        // when the event carries identity
    variant = entry.mutable_<Name>();              // construct the oneof submessage
    entry[+0x28] = <oneof field number>;           // stamp the dense proto case
    GetBits64(dec, w0, &variant.f0);               // typed fixed-width payload …
    GetBits64(dec, w1, &variant.f1);
    …
    consumed = (view.length * 8) - dec.bits_remaining();   // BitsDecoded()
    CHECK(consumed == <hardcoded constant>);       // MakeCheckOpString on mismatch ⇒ FATAL
    *bytes_consumed = 0x10;                         // 16 bytes, on success

The final CHECK validates that the handler consumed exactly the bit count the format demands — absl::log_internal::MakeCheckOpString<…>(consumed, K, "decoder.BitsDecoded() == K") fires a fatal log on mismatch. This is the codec's self-consistency guard: it pins each event's total wire width.

Byte-confirmed total-bit CHECK constants (read directly from each decoder's MakeCheckOpString sites) and their full width sequences. Each sequence begins with the 2 framing bits, then [8/3/48] DecodeTraceHeader, then (for events that carry identity) the [21/3/12] TraceIdHeader, then the typed payload:

Event (pxc)idoneofdecoderCHECK constant(s)width sequence (framing · header · [idhdr] · payload)
UhiHostDmaTransactionStartedAddressTranslation020xf5b0b80128, 2162 · 8/3/48 · 21/3/12 · 5,16,10,1,1,54,32
UhiHostPhysicalRequestRead130xf5b0f20128, 2332 · 8/3/48 · 21/3/12 · 1,30,1,1,29,26,8,20,20
TcsInternalSetSyncFlag81380xf5b97a01212 · 8/3/48 · (no idhdr) · 32,1,9,16,1,1
IciPacketPacketReceivedOnLinkInput40210xf5b56c01252 · 8/3/48 · 21/3/12 · 3,3,6,1,1,12,1,1
ThrottleStateThermalAndElectrical97540xf5bc6201202 · 8/3/48 · (no idhdr) · 4,5,5,10,4,21,5,5

NOTE — the per-event CHECK constant is not always a single value. Both DecodeUhiHostDmaTransactionStartedAddressTranslation and DecodeUhiHostPhysicalRequestRead contain two MakeCheckOpString sites — the first CHECKs the prefix path at 128 bits, the second CHECKs the full path (216 and 233 respectively) after an optional/conditional payload group. The ThrottleStateThermalAndElectrical decoder, despite its name, carries a single oneof case (field 54 = 0x36) and a single CHECK == 120 (there is no 0x36/0x37 A/B discrimination at decode). The mechanism — a hardcoded total-bit CHECK per consumed path — is byte-confirmed CERTAIN; the precise constant is branch-dependent for events with optional/repeated payload fields. Recover the exact set per branch from the event's Decode<Name>/Encode<Name> pair.

The per-band semantic field maps — what each width means — are owned by the payload pages: UHI/OCI/ICI/DMA, SparseCore band, vfc/vlc/gfc. The exhaustive (offset, width, semantic) tuple for all 99–144 events per family is mechanically dumpable from the Decode<Name>/Encode<Name> pairs but is not tabulated here (LOW confidence on completeness — same gap as the payload pages note).


The Decode Driver and Per-Family Factory

DecodeTraceBuffers — the walk loop

xprof::tpu::DecodeTraceBuffers<TraceEntry> @ 0xf59ffa0 (pxc instantiation) is the driver that turns a compressed device-trace blob into a RepeatedPtrField<TraceEntry>:

// DecodeTraceBuffers<TraceEntry> @0xf59ffa0
function DecodeTraceBuffers(codec, out_entries, ..., scratch):
    StringReader src(blob);                         // @0xf59eac0
    ZlibReader  zin(&src);  zin.Initialize();       // ZlibReaderBase::Initialize @0xf69f9e0
    ReadAllImpl(&zin, &decompressed);               // read_all_internal::ReadAllImpl @0xf5acf40
    view = decompressed;
    while (view.length >= 0x10) {                   // one 16-byte packet at a time
        bool valid, started;
        TraceEntry* e = out_entries.Add();
        codec->vtable.DecodeEntry(view, e, &valid, &started);  // call *0x18(vtable)
        if (!valid) break;                          // graceful EOS
        view.remove_prefix(0x10);                   // advance 16 bytes
    }

The transport — the riegeli record framing around the ZlibReader, the zlib window/dictionary, and whether multiple per-core ring drains are separate riegeli records — is owned by riegeli Trace Container. DecodeTraceBuffers itself only sees one inflated StringReader stream.

TraceCodecInterface and the factory

The per-chip-family codec is a concrete TraceCodecInterface<TraceEntry> (abstract base; the four vtable slots are DecodeEntry/EncodeEntry/GetMaxEntrySize/GetEntryPacketSize). It is constructed by CreateTraceCodec per family and registered into a static type-factory keyed by asic_sw::DeviceIdentifiers (the chip codename), via DeviceIdentifiersAsString:

FamilyCreateTraceCodecDecodeEntryDecodeTraceHeaderblock/tsDecodeTraceBuffers template
pxc0xf5af2c0 (plc symbol)0xf5af3a00xf5d4f203/48<pxc::…::TraceEntry>
vfc0xf5f5da0(per family)0xf6280806/45<vxc::vfc::…::TraceEntry>
vlc0xf5d5180(per family)0xf5f5b403/45<vxc::vlc::…::TraceEntry>
glc0xf6282e0(per family)0xf65eaa06/45<gxc::glc::…::TraceEntry>
gfc0xf65ed00(per family)0xf697b006/45<gxc::gfc::…::TraceEntry>
jxc(legacy path)<jxc::PerformanceTraceEntry>

GetTraceCodec — the selector

xprof::tpu::GetTraceCodec(asic_sw::DeviceIdentifiers, int) @ 0xf5a2900 is the runtime selector. The decompile shows it walking a chain of asic_sw::internal::TypeFactoryBase<DeviceIdentifiers, &DeviceIdentifiersAsString, TraceCodecInterface<…::TraceEntry>, false>::Create<> attempts — the function body references the vlc, vfc, glc, gfc, and jxc family codecs — and falls through to the pxc factory as the default:

// GetTraceCodec @0xf5a2900 (decompiled skeleton)
function GetTraceCodec(out, device_ids, gen):
    if (try TypeFactoryBase<…vlc::TraceEntry>::Create(out, device_ids)) return out;
    if (try TypeFactoryBase<…vfc::TraceEntry>::Create(out, device_ids)) return out;
    if (try TypeFactoryBase<…glc::TraceEntry>::Create(out, device_ids)) return out;
    // … gfc via its own GetTraceCodec<gfc::TraceEntry> @0xf5a2b60; jxc PerformanceTraceEntry path …
    TypeFactoryBase<…pxc::TraceEntry>::Create(out, device_ids);     // default
    return out;

Each family also has a templated GetTraceCodec<…::TraceEntry> instantiation (e.g. gfc @ 0xf5a2b60) that wraps the family-specific unique_ptr<TraceCodecInterface<TraceEntry>> construction. The selector returns a unique_ptr the DecodeTraceBuffers driver then drives polymorphically.

QUIRK — jxc does not share this codec. It uses a different proto type — asic_sw::driver::deepsea::jxc::PerformanceTraceEntry — decoded by its own DecodeTraceBuffers<PerformanceTraceEntry> instantiation, not the fixed-16-byte TraceEntry path. A reimplementation that assumes one packet schema across all generations will misparse jxc traces. The jxc specifics are on Payload: jxc Legacy.


Relevant Struct and Table Offsets

SymbolAddress / offsetRole
TraceEntry (proto2)+0x18 trace_header ptr; +0x20 active oneof variant ptr; +0x28 oneof case (proto field number — the encode dispatch key)the decoded message
TraceHeader (proto2)+0x10 presence (bit0=id, bit1=block, bit2=ts); +0x18 trace_point_id; +0x1c block_id; +0x20 timestampthe universal header
TraceIdHeader (proto2)+0x10 presence; +0x18 transaction_id; +0x1c core_id; +0x20 chip_idthe identity sub-record — 36 bits on pxc (chip_id 12), 38 bits on vfc/vlc/glc/gfc (chip_id 14)
BitDecoder+0x8 cursor, +0x10 end, +0x18 buffer (LSB-first), +0x20 bits_avail (0x28 total)the bit-stream window
mask_0xbe7944065-qword mask table, mask_[k]=(1<<k)-1
Decode jump table (low)0xab85bc0111 rel32 entries, ids 0..0x6e, indexed by 8-bit trace_point_id (pxc)
Decode jump table (high)0xab85d7c145 rel32 entries, ids 0x6f..0xff (id-0x6f), contiguous after the low table
Encode jump table0xab85fc0rel32 entries, indexed by oneof_field - 2 (pxc)
MakeErrorImpl<3> string0x9ff8a9c"Found a valid but not started packet." (INVALID_ARGUMENT; 0x2e02 is the source line)

ComponentRelationship
riegeli Trace Containerthe compressed transport that DecodeTraceBuffers inflates before this codec walks 16-byte packets
TracePoints Master Registrythe wire-id ↔ oneof-field two-id-space table the dual dispatch realizes
Payload: UHI/OCI/ICI/DMAthe per-band payload field maps for the host-DMA/interconnect/fabric bands
Payload: SparseCore Bandthe TCS/SparseCore band payload field maps
Payload: vfc/vlc/gfcthe newer-family payload deltas (6-bit block_id, 45-bit timestamp)
Payload: jxc Legacythe separate PerformanceTraceEntry schema and its own codec
TraceEntry → XEvent/XStatthe downstream shaping of a decoded TraceEntry into a device-plane XEvent + XStats

Cross-References