TraceEntriesCoder
All addresses and offsets on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d). The binary is not stripped — full C++ symbols are present, and.textVMA equals file offset. Other versions will differ.
Abstract
TraceEntriesCoder is the on-device profiler trace-entry codec: the fixed-width, LSB-first bit format that every TPU hardware trace event packs into, and the per-chip-family decoder/encoder that translates between that wire packet and a proto2 TraceEntry message. It is the bottom-most layer of the xprof device-trace pipeline — the stage between the compressed riegeli container and the TraceEntry → XEvent/XStat shaping. Where a route-cache record is a self-delimiting varint stream, a profiler trace event is the opposite: a constant 16-byte (128-bit) packet with a 2-bit framing prefix, a 56- or 59-bit header (59 for pxc/vfc/glc/gfc, 56 for vlc), an optional 36/38-bit transaction-identity sub-record (36 on pxc, 38 on vfc/vlc/glc/gfc), and a per-event fixed-width payload, all read with the shared GetBits64/SkipBits bit-codec primitives.
The codec has a deliberately asymmetric two-id-space dispatch, and getting it right is the whole reimplementation. Decode peeks the 2 framing bits and the 8-bit trace_point_id — the banded hardware enum value, gappy, max handled 0x95 (149) for pxc plus a 0xff dummy — and indexes a pair of contiguous rel32 jump tables (the first @ 0xab85bc0 bounded at 0x6e, the second @ 0xab85d7c for ids 0x6f..0xff) to reach an anonymous-namespace Decode<EventName>(). Encode dispatches on the dense proto oneof field number stored at TraceEntry+0x28 through a parallel jump table. The two id spaces are not interchangeable; the registry that pairs them is owned by TracePoints Master Registry, and a reimplementation that drives encode off the wire id (or decode off the oneof field) mis-keys every event.
This page owns the codec format and dispatch: the 16-byte packet, the framing/header/TraceIdHeader bit layout, the per-event total-bit CHECK validation, the per-family CreateTraceCodec/GetTraceCodec factory wiring, and how the header split shifts across silicon generations. It does not own the per-band payload field maps (UHI/OCI/ICI/DMA, SparseCore band, vfc/vlc/gfc, jxc legacy), the trace-point id registry (master registry), the compressed container (riegeli), or the XEvent translation (trace-entry-to-xevent).
For reimplementation, the contract is:
- The fixed 16-byte packet and the bit-stream reader — LSB-first,
GetBits64NoInline(n)for an n-bit field,SkipBitsNoInline(n)to advance, masked bymask_[k] = (1<<k)-1. - The 2-bit framing prefix + 56/59-bit
TraceHeader+ optional 36/38-bitTraceIdHeader— the envelope, with the per-familyblock_id/timestampsplit (pxc/vfc/glc/gfc = 59-bit header; vlc = 56-bit) and thechip_id12→14 widening at Viperfish. - The dual dispatch — decode by 8-bit wire
trace_point_id(DecodeEntryjump table), encode by dense oneof field number (EncodeEntryjump table); thevalid/startedframing semantics; the per-event total-bitCHECK. - The per-family factory —
GetTraceCodec(DeviceIdentifiers, int)selecting one of pxc/vfc/vlc/glc/gfc fixed-width codecs (or the jxc legacyPerformanceTraceEntrypath) from a static type-factory keyed by chip codename.
| Codec interface | asic_sw::driver::deepsea::profiler::TraceCodecInterface<TraceEntry> (abstract; vtable: DecodeEntry/EncodeEntry/GetMaxEntrySize/GetEntryPacketSize) |
| Packet size | fixed 16 bytes (128 bits) — GetEntryPacketSize()==0x10, GetMaxEntrySize()==0x20 (decoded-proto upper bound), all 5 families |
| Bit order | LSB-first; read via BitDecoder::GetBits64NoInline @ 0x21073760, SkipBitsNoInline @ 0x21073580, mask table mask_ @ 0xbe79440 |
| Header layout | valid:1 · started:1 · trace_point_id:8 · block_id:3│6 · timestamp:48│45 — framing+header = 61 bits for pxc/vfc/glc/gfc (payload @ bit 61); vlc is 8/3/45 → 56-bit header, payload @ bit 58 |
| Decode entry | pxc TraceEntriesCoder::DecodeEntry @ 0xf5af3a0 → two contiguous rel32 jump tables: 111-entry @ 0xab85bc0 (ids 0..0x6e) + 145-entry @ 0xab85d7c (ids 0x6f..0xff) |
| Encode entry | pxc TraceEntriesCoder::EncodeEntry @ 0xf5c5e60 → parallel jump table @ 0xab85fc0 |
| Decode driver | xprof::tpu::DecodeTraceBuffers<TraceEntry> @ 0xf59ffa0 (pxc) — inflates + walks 16-byte packets |
| Codec selector | xprof::tpu::GetTraceCodec(asic_sw::DeviceIdentifiers, int) @ 0xf5a2900 |
| Factory | pxc::driver::profiler::CreateTraceCodec (plc symbol @ 0xf5af2c0); vfc @ 0xf5f5da0, vlc @ 0xf5d5180, glc @ 0xf6282e0, gfc @ 0xf65ed00 |
| Source paths (rodata) | platforms/asic_sw/driver/deepsea/<fam>/profiler/trace_entries.proto, …/trace_codec_factory.cc; third_party/gloop/util/coding/bitcoding.cc |
The Fixed 16-Byte Packet
Every on-device profiler trace event is a constant-size 16-byte (128-bit) packet — not a varint, gamma, or any self-delimiting record. This is the single most important divergence from the rest of the gloop bit-codec toolkit, which elsewhere uses varint-framed records: the profiler path is pure fixed-width.
The constant size is proven three ways in the binary:
GetEntryPacketSize()returns0x10(16) in all five families — each is a singlemov $0x10,%eax; retat pxc0xf5d4ec0, vfc0xf628020, vlc0xf5f5ae0, glc0xf65ea40, gfc0xf697aa0.GetMaxEntrySize()returns0x20(32) at the matching-0x20addresses — this is the decoded proto's in-RAM upper bound, not the wire size. A reimplementer must not confuse the two: 16 bytes on the wire, ≤32 bytes as a deserializedTraceEntry.- Each per-event decoder gates on input length before reading:
cmp $0xf,%rdx; ja …requires thestring_viewto be longer than 15 bytes (i.e. at least one whole 16-byte packet), and on success records bytes-consumed =0x10(movq $0x10,0x8(%rbx)).
The packet is an LSB-first bit-stream read over the shared BitDecoder window ({cursor@+0x8, end@+0x10, buffer@+0x18, bits_avail@+0x20}, a 0x28-byte object). In the decompiled DecodeEntry, the window is stack-constructed inline from the input string_view:
// DecodeEntry @0xf5af3a0 — window init (decompiled v14[] = the BitDecoder)
v14[1] = data_ptr; // +0x8 cursor = start of the 16-byte packet
v14[0] = data_ptr; // +0x0/+0x18 buffer base (LSB-first source)
v14[2] = data_ptr + length; // +0x10 end
v15 = 0; // +0x20 bits_avail
Fields are extracted with two primitives from bitcoding.cc:
| Primitive | Address | Effect |
|---|---|---|
BitDecoder::GetBits64NoInline(dec, n, out) | 0x21073760 | read the next n bits (LSB-first) into *out, masked by mask_[n]; advance the cursor n bits |
BitDecoder::SkipBitsNoInline(dec, n) | 0x21073580 | advance the cursor n bits without materializing them |
mask_[k] = (1<<k)-1 | 0xbe79440 | 65-qword .rodata mask table; +0x8→mask_[1], +0x18→mask_[3], +0x40→mask_[8], +0xa8→mask_[21], +0x100→mask_[32], +0x180→mask_[48] (offsets verified against the field widths) |
NOTE —
GetBits64NoInlinefaults (ud1) onn > 64and handles the within-buffer straddle internally; the codec never asks for more than 48 bits in a single call. TheNoInlinevariants are the out-of-line copies; the same logic is inlined at thousands of call sites across the per-event decoders.
The Framing Prefix and the TraceHeader
Every packet opens with a 2-bit framing prefix and a TraceHeader, together a 59- or 61-bit envelope depending on family, after which the typed payload begins.
bit field width proto
─── ──────────────────── ─────────────────────── ──────────────────────────────
0 valid 1 (framing — 0 ⇒ end of buffer)
1 started 1 (framing — valid&&!started ⇒ error)
2-9 trace_point_id 8 TraceHeader.trace_point_id (f1)
10-K trace_point_block_id 3 (pxc/vlc) │ 6 (vfc/glc/gfc) TraceHeader.f2
K+1 timestamp 48 (pxc) │ 45 (vfc/vlc/glc/gfc) TraceHeader.f3
(pxc/vfc/glc/gfc header ends at bit 60 ⇒ framing+header = 61, payload @ bit 61)
(vlc header ends at bit 57 ⇒ framing+header = 58, payload @ bit 58)
The header sub-budget (id + block_id + timestamp) is 59 bits for pxc/vfc/glc/gfc but 56 bits for vlc (byte-confirmed: vlc DecodeTraceHeader @0xf5f5b40 reads 8 / 3 / 45, not 8 / 3 / 48). For the 59-bit families, newer silicon widens block_id from 3 to 6 bits and narrows the timestamp from 48 to 45 bits to keep the envelope constant — so for those four the payload origin stays at bit 61. vlc is the outlier: it keeps the 3-bit block_id and the 45-bit timestamp, so its header is only 56 bits and its payload begins at bit 58 (vlc Encode<…> @0xf5f3700 confirms the inverse: block<<10, ts(45)<<13, payload<<58). A cross-gen reimplementation must derive the timestamp width and payload origin per family from the family's DecodeTraceHeader, never assume a fixed 59-bit envelope.
Framing semantics — valid and started
The two framing bits are not part of the proto; they are wire-level flow control:
valid(bit 0) is the empty-slot sentinel. Avalid == 0packet means graceful end of stream — the buffer can be drained to its capacity without an explicit entry count; the decoder stops and reports success with*started_out = 0. This is why a ring buffer can be over-allocated and read until the first cleared slot.started(bit 1) catches torn/partial hardware writes.valid && !startedreturns anINVALID_ARGUMENTstatus: the decoder buildsMakeErrorImpl<3>("Found a valid but not started packet.")(string @0x9ff8a9c; the<3>template arg is theINVALID_ARGUMENTstatus code, and0x2e02is the source-location line passed toAddSourceLocationImpl).
The decompiled head of DecodeEntry shows the exact read order and branch logic:
// pxc TraceEntriesCoder::DecodeEntry @0xf5af3a0
GetBits64NoInline(dec, 1, &valid); // slot -0x58 (bit 0)
GetBits64NoInline(dec, 1, &started); // slot -0x50 (bit 1)
GetBits64NoInline(dec, 8, &id); // slot -0x48 (bits 2..9) — peek, then re-decode in handler
if (valid) {
if (started) {
*valid_out = 1;
*started_out = 0; // (sic — set then overwritten by handler success path)
if (id <= 0x6e) { // first jump table @0xab85bc0 (111 entries, ids 0..0x6e)
switch (id) {
case 0: DecodeUhiHostDmaTransactionStartedAddressTranslation(...); break;
case 1: DecodeUhiHostPhysicalRequestRead(...); break;
// … 0-6 UHI, 7-10/20-27 OCI, 40-48 ICI, 49-55 OCI, 80-90 TCS, 91-96 OCI-from-TCS, 97 throttle …
default: goto second_table; // reserved slots fall to @0xf5b032f
}
} else { // second jump table @0xab85d7c (ids 0x6f..0xff)
second_table: // @0xf5b032f: id -= 0x6f; if (id > 0x90) goto unhandled;
switch (id) {
// … 100-134 BcFsm/Bcs/BcOci (SparseCore), 140-149 CMQ, 0xff dummy …
default: goto unhandled; // @0xf5b0b33: bytes_consumed=0, return OK
}
}
} else {
return MakeErrorImpl<3>("Found a valid but not started packet."); // @0x9ff8a9c
}
} else {
*valid_out = 0; // EOS — graceful, status OK
return OK;
}
GOTCHA —
DecodeEntryreads the 8-bit id only to index the jump table. EachDecode<Name>()handler then re-decodes the packet from bit 0 —SkipBits(2)past the framing, a fullDecodeTraceHeader, then the typed payload — because the handler needs the header fields materialized into the proto, not just the id peeked. A reimplementation that tries to share one header decode between the dispatcher and the handler will double-consume or mis-position the cursor.
Per-gen header widths
DecodeTraceHeader is an anonymous-namespace helper, one per family, that reads exactly three GetBits64 calls in order and stamps the TraceHeader proto. The pxc version is byte-confirmed in the decompile:
// pxc DecodeTraceHeader @0xf5d4f20
GetBits64NoInline(dec, 8, &id); th = Arena::DefaultConstruct<TraceHeader>(); th[+0x18]=id; th[+0x10] |= 1;
GetBits64NoInline(dec, 3, &block); th[+0x1c]=block; th[+0x10] |= 2;
GetBits64NoInline(dec, 48, &ts); th[+0x20]=ts; th[+0x10] |= 4;
The |= 1 / 2 / 4 are the proto2 has-bits (presence byte at TraceHeader+0x10, bit0=id, bit1=block_id, bit2=timestamp). The widths per family:
| Family | DecodeTraceHeader | id | block_id | timestamp | header bits | payload start |
|---|---|---|---|---|---|---|
| pxc | 0xf5d4f20 | 8 | 3 | 48 | 59 | 61 |
| vfc | 0xf628080 | 8 | 6 | 45 | 59 | 61 |
| vlc | 0xf5f5b40 | 8 | 3 | 45 | 56 | 58 |
| glc | 0xf65eaa0 | 8 | 6 | 45 | 59 | 61 |
| gfc | 0xf697b00 | 8 | 6 | 45 | 59 | 61 |
QUIRK — the
timestampis the raw device cycle counter (48 or 45 bits), not picoseconds. The cycle→ps conversion (per-gen device clock rate) and the per-linetimestamp_nsorigin are applied downstream inTpuXLineBuilder, never in the codec. A reimplementation that treats the on-wire timestamp as picoseconds is off by the clock period and will mis-compute the counter wrap interval (≈2^48cycles at 48 bits). See TraceEntry → XEvent/XStat.
The TraceIdHeader Sub-Record
Most variant messages carry a TraceIdHeader immediately after the header — the per-transaction identity that lets a multi-packet DMA be stitched back together. It is decoded inline (there is no standalone DecodeTraceIdHeader symbol) into a nested proto2 TraceIdHeader submessage. Its width is 36 bits on pxc (chip_id = 12) but 38 bits on vfc/vlc/glc/gfc (chip_id widens to 14 at Viperfish); transaction_id (21) and core_id (3) are constant across all families.
rel bit field width proto
─────── ─────────────── ─────────── ────────────────────────────────────────
0-20 transaction_id 21 TraceIdHeader.transaction_id (f1)
21-23 core_id 3 f2 — enum {TC0,TC1,BC0..BC3,NONCORE,…} (8 values ⇒ 3 bits)
24-.. chip_id 12 (pxc) │ 14 (vfc/vlc/glc/gfc) f3
= 36 bits (pxc) │ 38 bits (vfc/vlc/glc/gfc), immediately after the header when present
Byte-confirmed in pxc DecodeUhiHostPhysicalRequestRead @ 0xf5b0f20, where the three opening GetBits64 calls are 21 / 3 / 12; and in the newer-gen DecodeScMessageOutboundInternalMessage (vfc) @ 0xf60e6c0, where they are 21 / 3 / 14:
// DecodeUhiHostPhysicalRequestRead @0xf5b0f20 — TraceIdHeader, then payload
GetBits64NoInline(dec, 21, &transaction_id); // TraceIdHeader f1
GetBits64NoInline(dec, 3, &core_id); // f2 (3-bit enum)
GetBits64NoInline(dec, 12, &chip_id); // f3
// then the typed payload: 1, 30, 1, 1, 29, 26, 8, 20, 20 …
The 3-bit core_id exactly fits the 8-value CORE_ID enum (RESERVED/NONCORE/TC0/TC1/BC0..BC3). Some events carry multiple TraceIdHeaders — OCI read/write commands embed three (cmd0/cmd1/cmd2), i.e. 3 × 36 = 108 bits of identity before any other payload. The per-band detail of which events carry one, three, or zero is owned by the payload pages.
The Dual Dispatch
The codec is deliberately asymmetric: decode and encode index different jump tables keyed by different id spaces.
Decode — by 8-bit wire trace_point_id
DecodeEntry dispatches on the 8-bit on-wire trace_point_id — the banded hardware enum, gappy, max handled 0x95 = 149 for pxc (plus a 0xff dummy) — through two contiguous rel32 jump tables: the first @ 0xab85bc0 (111 entries) covers ids 0..0x6e (cmp $0x6e; ja), and a default-path second table @ 0xab85d7c (145 entries, reached at 0xf5b032f where id -= 0x6f; cmp $0x90; ja) covers ids 0x6f..0xff. The tables are read byte-exact; their band structure mirrors the trace-point registry:
| Band | trace_point_id (pxc) | Subsystem |
|---|---|---|
| UHI | 0–6 | host-DMA / address translation |
| OCI | 7–10, 20–27, 49–55, 91–96 (OCI-issued-from-TCS) | on-chip interconnect engine |
| ICI | 40–48 | inter-chip interconnect / collective fabric |
| TCS | 80–90 | TensorCore sequencer sync/control (sync/fence/instruction + interrupt) |
| Throttle | 97 | thermal/electrical throttle state |
| BC (SparseCore/broadcast) | 100–134 | BcFsm / Bcs / BcOci controllers |
| CMQ | 140–149 | command queue / VPU DMA |
| Dummy | 255 (0xff) | DummyTraceEntryDummyTracePoint |
| (reserved) | 11–19, 28–39, 56–79, 98–99, 135–139, 150–254 | unhandled → graceful return |
GOTCHA — the reserved id ranges do not fall through to a neighbour handler. Reserved slots in the first table (ids ≤
0x6e) jump to0xf5b032f, which is not an error label — it is the entry to the second dispatch table; both tables' out-of-range / unfilled slots ultimately land at0xf5b0b33, which setsbytes_consumed = 0and returnsOK(an unhandled id is silently skipped, not a status error). This confirms the band gaps are deliberate reserved space, not a decode bug. Drive band detection off the per-family jump table contents, never off a hardcoded pxc range; the band boundaries shift per generation as trace-point cardinality grows (99 handled cases for pxc → 144 for gfc).
Encode — by dense oneof field number
EncodeEntry @ 0xf5c5e60 dispatches on the dense proto oneof field number held at TraceEntry+0x28 (the proto2 oneof case), through a parallel rel32 table at 0xab85fc0:
// EncodeEntry @0xf5c5e60 — dispatch + inlined header pack
field = *(int*)(entry + 0x28); // proto oneof case (dense)
idx = field - 2; // table indexed by field-2
if (idx > 0x62 /*pxc bound*/) default;
goto *jumptable_0xab85fc0[idx]; // → Encode<Name>()
// inlined header pack (Encode<Name>, word0):
word0 = (mask_[1] & flag) * 3; // flag*3 ⇒ bits 0 and 1 both set (valid=1, started=1)
word0 |= id << 2; // trace_point_id
word0 |= block << 10; // block_id
word0 |= ts << 13; // timestamp (pxc/vlc) — vfc/glc/gfc shift <<16 (6-bit block)
word0 |= payload_lo << 61; // payload @ bit 61 (pxc/vfc/glc/gfc); vlc shifts <<58
// word1 = remaining payload; two qwords (16 B) written to the encoder SSO buffer
The encode shifts are the byte-exact inverse of the decode GetBits widths: <<2 (after the 2 framing bits), <<10 (after id), then the block/timestamp split is family-dependent — pxc/vlc shift ts<<13 (after a 3-bit block), vfc/glc/gfc shift ts<<16 (after a 6-bit block) — and the payload origin is <<61 for pxc/vfc/glc/gfc but <<58 for vlc (vlc's 45-bit timestamp + 3-bit block ⇒ 56-bit header; byte-confirmed at Encode<…> @0xf5f3700). The framing is written as flag*3 — mask_[1] & 1 then lea(r8,r8,2) — which sets both bit 0 and bit 1, i.e. valid=1, started=1 for every real entry.
The two id spaces, paired
The wire id and the oneof field are distinct namespaces; the handler stamps the oneof field via movl $field,0x28(%entry). Worked, byte-confirmed pairs:
trace_point_id (wire) | event | oneof field (proto) | encode bound cmp |
|---|---|---|---|
| 0 | UhiHostDmaTransactionStartedAddressTranslation | 2 | — |
| 1 | UhiHostPhysicalRequestRead | 3 | — |
| 40 | IciPacketPacketReceivedOnLinkInput | 21 (0x15) | — |
| 81 | TcsInternalSetSyncFlag | 38 (0x26) | — |
| 97 | ThrottleStateThermalAndElectrical | 54 (0x36) | — |
Per-family oneof-field encode bounds (the cmp $imm before the jump): pxc 0x62, vlc 0x4d, vfc 0x79, glc 0x7f, gfc 0x7f. The full id↔field registry is owned by TracePoints Master Registry.
QUIRK — decode max handled wire id (
0x95pxc, spread across two jump tables split at0x6e/0x6f) and encode max oneof field (0x62pxc, after thefield-2bias) differ because the wire id is banded with gaps while the oneof field is dense. Both count the same set of events (99 for pxc); only the indexing differs. A reimplementation that sizes one table to the other's bound will overrun or truncate.
The Per-Event Decode Contract and the Total-Bit CHECK
Each Decode<Name>(string_view, bool* started_out, TraceEntry* out) follows the same shape:
// generic Decode<EventName> contract (e.g. DecodeUhiHostPhysicalRequestRead @0xf5b0f20)
function Decode<Name>(view, started_out, entry):
if view.length <= 0xf: return error // need a full 16-byte packet
BitDecoder dec(view);
SkipBits(dec, 2); // skip the 2 framing bits
DecodeTraceHeader(entry, dec); // id/block/timestamp into TraceHeader
[ DecodeTraceIdHeader inline: 21/3/12 ] // when the event carries identity
variant = entry.mutable_<Name>(); // construct the oneof submessage
entry[+0x28] = <oneof field number>; // stamp the dense proto case
GetBits64(dec, w0, &variant.f0); // typed fixed-width payload …
GetBits64(dec, w1, &variant.f1);
…
consumed = (view.length * 8) - dec.bits_remaining(); // BitsDecoded()
CHECK(consumed == <hardcoded constant>); // MakeCheckOpString on mismatch ⇒ FATAL
*bytes_consumed = 0x10; // 16 bytes, on success
The final CHECK validates that the handler consumed exactly the bit count the format demands — absl::log_internal::MakeCheckOpString<…>(consumed, K, "decoder.BitsDecoded() == K") fires a fatal log on mismatch. This is the codec's self-consistency guard: it pins each event's total wire width.
Byte-confirmed total-bit CHECK constants (read directly from each decoder's MakeCheckOpString sites) and their full width sequences. Each sequence begins with the 2 framing bits, then [8/3/48] DecodeTraceHeader, then (for events that carry identity) the [21/3/12] TraceIdHeader, then the typed payload:
| Event (pxc) | id | oneof | decoder | CHECK constant(s) | width sequence (framing · header · [idhdr] · payload) |
|---|---|---|---|---|---|
UhiHostDmaTransactionStartedAddressTranslation | 0 | 2 | 0xf5b0b80 | 128, 216 | 2 · 8/3/48 · 21/3/12 · 5,16,10,1,1,54,32 |
UhiHostPhysicalRequestRead | 1 | 3 | 0xf5b0f20 | 128, 233 | 2 · 8/3/48 · 21/3/12 · 1,30,1,1,29,26,8,20,20 |
TcsInternalSetSyncFlag | 81 | 38 | 0xf5b97a0 | 121 | 2 · 8/3/48 · (no idhdr) · 32,1,9,16,1,1 |
IciPacketPacketReceivedOnLinkInput | 40 | 21 | 0xf5b56c0 | 125 | 2 · 8/3/48 · 21/3/12 · 3,3,6,1,1,12,1,1 |
ThrottleStateThermalAndElectrical | 97 | 54 | 0xf5bc620 | 120 | 2 · 8/3/48 · (no idhdr) · 4,5,5,10,4,21,5,5 |
NOTE — the per-event
CHECKconstant is not always a single value. BothDecodeUhiHostDmaTransactionStartedAddressTranslationandDecodeUhiHostPhysicalRequestReadcontain twoMakeCheckOpStringsites — the firstCHECKs the prefix path at128bits, the secondCHECKs the full path (216and233respectively) after an optional/conditional payload group. TheThrottleStateThermalAndElectricaldecoder, despite its name, carries a single oneof case (field54 = 0x36) and a singleCHECK == 120(there is no0x36/0x37A/B discrimination at decode). The mechanism — a hardcoded total-bitCHECKper consumed path — is byte-confirmed CERTAIN; the precise constant is branch-dependent for events with optional/repeated payload fields. Recover the exact set per branch from the event'sDecode<Name>/Encode<Name>pair.
The per-band semantic field maps — what each width means — are owned by the payload pages: UHI/OCI/ICI/DMA, SparseCore band, vfc/vlc/gfc. The exhaustive (offset, width, semantic) tuple for all 99–144 events per family is mechanically dumpable from the Decode<Name>/Encode<Name> pairs but is not tabulated here (LOW confidence on completeness — same gap as the payload pages note).
The Decode Driver and Per-Family Factory
DecodeTraceBuffers — the walk loop
xprof::tpu::DecodeTraceBuffers<TraceEntry> @ 0xf59ffa0 (pxc instantiation) is the driver that turns a compressed device-trace blob into a RepeatedPtrField<TraceEntry>:
// DecodeTraceBuffers<TraceEntry> @0xf59ffa0
function DecodeTraceBuffers(codec, out_entries, ..., scratch):
StringReader src(blob); // @0xf59eac0
ZlibReader zin(&src); zin.Initialize(); // ZlibReaderBase::Initialize @0xf69f9e0
ReadAllImpl(&zin, &decompressed); // read_all_internal::ReadAllImpl @0xf5acf40
view = decompressed;
while (view.length >= 0x10) { // one 16-byte packet at a time
bool valid, started;
TraceEntry* e = out_entries.Add();
codec->vtable.DecodeEntry(view, e, &valid, &started); // call *0x18(vtable)
if (!valid) break; // graceful EOS
view.remove_prefix(0x10); // advance 16 bytes
}
The transport — the riegeli record framing around the ZlibReader, the zlib window/dictionary, and whether multiple per-core ring drains are separate riegeli records — is owned by riegeli Trace Container. DecodeTraceBuffers itself only sees one inflated StringReader stream.
TraceCodecInterface and the factory
The per-chip-family codec is a concrete TraceCodecInterface<TraceEntry> (abstract base; the four vtable slots are DecodeEntry/EncodeEntry/GetMaxEntrySize/GetEntryPacketSize). It is constructed by CreateTraceCodec per family and registered into a static type-factory keyed by asic_sw::DeviceIdentifiers (the chip codename), via DeviceIdentifiersAsString:
| Family | CreateTraceCodec | DecodeEntry | DecodeTraceHeader | block/ts | DecodeTraceBuffers template |
|---|---|---|---|---|---|
| pxc | 0xf5af2c0 (plc symbol) | 0xf5af3a0 | 0xf5d4f20 | 3/48 | <pxc::…::TraceEntry> |
| vfc | 0xf5f5da0 | (per family) | 0xf628080 | 6/45 | <vxc::vfc::…::TraceEntry> |
| vlc | 0xf5d5180 | (per family) | 0xf5f5b40 | 3/45 | <vxc::vlc::…::TraceEntry> |
| glc | 0xf6282e0 | (per family) | 0xf65eaa0 | 6/45 | <gxc::glc::…::TraceEntry> |
| gfc | 0xf65ed00 | (per family) | 0xf697b00 | 6/45 | <gxc::gfc::…::TraceEntry> |
| jxc | (legacy path) | — | — | — | <jxc::PerformanceTraceEntry> |
GetTraceCodec — the selector
xprof::tpu::GetTraceCodec(asic_sw::DeviceIdentifiers, int) @ 0xf5a2900 is the runtime selector. The decompile shows it walking a chain of asic_sw::internal::TypeFactoryBase<DeviceIdentifiers, &DeviceIdentifiersAsString, TraceCodecInterface<…::TraceEntry>, false>::Create<> attempts — the function body references the vlc, vfc, glc, gfc, and jxc family codecs — and falls through to the pxc factory as the default:
// GetTraceCodec @0xf5a2900 (decompiled skeleton)
function GetTraceCodec(out, device_ids, gen):
if (try TypeFactoryBase<…vlc::TraceEntry>::Create(out, device_ids)) return out;
if (try TypeFactoryBase<…vfc::TraceEntry>::Create(out, device_ids)) return out;
if (try TypeFactoryBase<…glc::TraceEntry>::Create(out, device_ids)) return out;
// … gfc via its own GetTraceCodec<gfc::TraceEntry> @0xf5a2b60; jxc PerformanceTraceEntry path …
TypeFactoryBase<…pxc::TraceEntry>::Create(out, device_ids); // default
return out;
Each family also has a templated GetTraceCodec<…::TraceEntry> instantiation (e.g. gfc @ 0xf5a2b60) that wraps the family-specific unique_ptr<TraceCodecInterface<TraceEntry>> construction. The selector returns a unique_ptr the DecodeTraceBuffers driver then drives polymorphically.
QUIRK — jxc does not share this codec. It uses a different proto type —
asic_sw::driver::deepsea::jxc::PerformanceTraceEntry— decoded by its ownDecodeTraceBuffers<PerformanceTraceEntry>instantiation, not the fixed-16-byteTraceEntrypath. A reimplementation that assumes one packet schema across all generations will misparse jxc traces. The jxc specifics are on Payload: jxc Legacy.
Relevant Struct and Table Offsets
| Symbol | Address / offset | Role |
|---|---|---|
TraceEntry (proto2) | +0x18 trace_header ptr; +0x20 active oneof variant ptr; +0x28 oneof case (proto field number — the encode dispatch key) | the decoded message |
TraceHeader (proto2) | +0x10 presence (bit0=id, bit1=block, bit2=ts); +0x18 trace_point_id; +0x1c block_id; +0x20 timestamp | the universal header |
TraceIdHeader (proto2) | +0x10 presence; +0x18 transaction_id; +0x1c core_id; +0x20 chip_id | the identity sub-record — 36 bits on pxc (chip_id 12), 38 bits on vfc/vlc/glc/gfc (chip_id 14) |
BitDecoder | +0x8 cursor, +0x10 end, +0x18 buffer (LSB-first), +0x20 bits_avail (0x28 total) | the bit-stream window |
mask_ | 0xbe79440 | 65-qword mask table, mask_[k]=(1<<k)-1 |
| Decode jump table (low) | 0xab85bc0 | 111 rel32 entries, ids 0..0x6e, indexed by 8-bit trace_point_id (pxc) |
| Decode jump table (high) | 0xab85d7c | 145 rel32 entries, ids 0x6f..0xff (id-0x6f), contiguous after the low table |
| Encode jump table | 0xab85fc0 | rel32 entries, indexed by oneof_field - 2 (pxc) |
MakeErrorImpl<3> string | 0x9ff8a9c | "Found a valid but not started packet." (INVALID_ARGUMENT; 0x2e02 is the source line) |
Related Components
| Component | Relationship |
|---|---|
| riegeli Trace Container | the compressed transport that DecodeTraceBuffers inflates before this codec walks 16-byte packets |
| TracePoints Master Registry | the wire-id ↔ oneof-field two-id-space table the dual dispatch realizes |
| Payload: UHI/OCI/ICI/DMA | the per-band payload field maps for the host-DMA/interconnect/fabric bands |
| Payload: SparseCore Band | the TCS/SparseCore band payload field maps |
| Payload: vfc/vlc/gfc | the newer-family payload deltas (6-bit block_id, 45-bit timestamp) |
| Payload: jxc Legacy | the separate PerformanceTraceEntry schema and its own codec |
| TraceEntry → XEvent/XStat | the downstream shaping of a decoded TraceEntry into a device-plane XEvent + XStats |
Cross-References
- Profiling and Telemetry Overview — the five-stage capture→encode→decode→xplane pipeline this codec is stage 3 of
- riegeli Trace Container — stage 2, the zlib/riegeli compressed transport feeding
DecodeTraceBuffers - TracePoints Master Registry — the banded wire id and dense oneof field id spaces this page's dual dispatch keys on
- TraceEntry → XEvent/XStat — stage 5, where the decoded proto becomes an XEvent and the raw cycle timestamp is converted to picoseconds
- Payload: UHI/OCI/ICI/DMA · SparseCore Band · vfc/vlc/gfc · jxc Legacy — the per-band payload field maps this codec page deliberately does not duplicate