Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

libtpu Internals — Reverse-Engineering Reference

Status: 426 pages across 18 parts · Primary binary: libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so — 781,691,048 B, x86-64 ELF64 DYN, not stripped, build-id 89edbbe81c5b328a958fe628a9f2207d · Secondary: sdk.so (94,732 functions)

What this reference is

A reimplementation-grade reverse-engineering reference for Google's libtpu.so — the PJRT plugin that exposes Cloud TPU hardware to JAX, PyTorch/XLA, and TensorFlow. It is the functional equivalent of NVIDIA's libcuda.so + libnvrtc.so + the device-specific half of nvcc/ptxas, compressed into a single 745 MB monolithic shared object that statically links the entire XLA compiler, every TPU MLIR dialect, the per-generation LLVM backends, oneDNN, tcmalloc, Abseil, gRPC, protobuf, Eigen, the TPU runtime, the device-driver shim, and the ICI/DCN fabric stack.

Everything here was reconstructed purely from static analysis of the binaryobjdump, nm, readelf -rW, raw byte reads, and protoc --decode_raw of carved descriptors. The binary ships unstripped — 884,832 disassembler-recovered functions, 881,784 of them (99.66 %) carrying a real symbol name — which is why reconstruction reached byte-exact / reimplementation grade across most of the surface.

Why it is hard

  • 884,832 functions in the analysis database (the per-function artifact directories hold a slightly higher 884,843 files, counting thunk/alias/data-stub entries; function counts cite 884,832, see Binary Forensics Overview); 1,249,324 strings; ~52 GB of extracted IDA sidecars.
  • 40,313 dispatch tables (≈100× ptxas's 409), classified into 19 taxonomy classes.
  • 160,351 RTTI records (_ZTI 60,457 · _ZTV 39,244 · _ZTS 60,650 · 2), the 60,457 typeinfos led by mlir:: (13,091), asic_sw:: (11,379), tensorflow:: (3,108), xla:: (3,036), llvm:: (2,940), with dnnl:: / std:: / grpc_core:: and a long vendored tail behind them.
  • ~2,900 static constructors in .init_array; 1,069,659 relocations (of which 1,069,006 are R_X86_64_RELATIVE).
  • The section-header table ends exactly at EOF — there is no trailing payload past it. A zstd magic immediate that appears inside .text is an inline constant, not a stored compression frame; see Trailing zstd Blob.
  • Custom ELF sections (google_malloc, malloc_hook, protodesc_cold, filewrapper_toc, __rseq_cs, __lcxx_override).
  • Six TPU silicon generations under a Google-internal codename ladder: jellyfish → dragonfish → pufferfish → viperfish → ghostlite → 6acc60406, each with its own ISA encoding, cost model, and HAL family.

Two-tier C ABI

                JAX / PyTorch-XLA / TensorFlow
                          │
                          ▼  PJRT C-API (v0.103)
              ┌──────────────────────────────┐
              │   PJRT layer (outer C-API)   │   ← outer ABI  (Part II)
              │   GetPjrtApi @ 0xe6a83a0     │
              │   140-slot PJRT_Api struct   │
              │   17 extensions chained      │
              └──────────────┬───────────────┘
                             │  Tpu* C shim (~200 symbols)
              ┌──────────────────────────────┐
              │   libtpu runtime + compiler  │   ← inner ABI  (Part III)
              │   xla::jellyfish::*          │
              │   asic_sw::deepsea::*        │
              │   platforms_deepsea::*       │
              └──────────────────────────────┘

How this reference is organized

The 18 parts follow the data's own dependency chain, not an alphabetical or importance order. Each part can be read assuming only the parts before it:

  silicon model ─► compiler passes ─► ISA encoding ─► cost model ─► scheduling
       (IV)             (V)              (VI)            (VII)         (VIII)
                                                                         │
   specialized engine (SparseCore, IX) ◄───────────────────────────────┘
       │
   on-chip memory & DMA (X) ─► runtime (XI) ─► distributed fabric (XII–XIV)
                                                  │
                              observability (XV) ─┴─► configuration (XVI)

The compiler back-end is deliberately factored along the canonical three-concern seam — what instructions exist (VI), what they cost (VII), how to order and pack them (VIII) — because in this binary the cost-model data is ~3× the volume of the scheduling algorithms, and conflating them produced a 50-page monster. SparseCore (IX) is kept whole rather than sliced across that seam: it is a self-contained engine a reader wants in one place.

Status and evidence grade

Each page below carries a grade reflecting how directly its claims are anchored in the binary:

  • CConfirmed / reimplementation-grade: byte-anchored against objdump/nm/readelf or protoc --decode_raw of carved descriptors. The default for the byte-level deep-dive pages.
  • IInferred / synthesis: foundational, forensic-survey, per-gen-parametric, or connective overview pages.
  • OOpen: not yet recovered; tracked in the Open-Frontier Register.

The evidence grade above (C/I/O) is the per-page label that matters. An O (open) page flags a specific not-yet-recovered detail, tracked in the Open-Frontier Register.

Parts at a glance

The Open column counts the pages still carrying an O (not-yet-recovered detail) grade.

PartTitlePagesOpenDepends onSource domain
0Reference Apparatus90
IBinary Anatomy1200forensics / dispatch / RTTI
IIPlugin Lifecycle & PJRT API230Iruntime / PJRT
IIITpu C-Shim Layer100IIshim
IVSilicon & Hardware Codename Model240silicon
VCompiler — Lowering & Optimization Passes360IVcompiler
VITensorCore ISA & LLO Encoding422IV, VISA
VIICost & Latency Model410IV, VIcost
VIIIInstruction Scheduling & Bundle Packing140VI, VIIcost / scheduling
IXSparseCore & BarnaCore450IV, VI, VIIsparsecore
XOn-Chip Memory & DMA200IVmemory / DMA
XIRuntime & Execution110II, VI, Xruntime
XIIInterconnect & Routing300IVcollectives / routing
XIIIOn-Pod Collectives & Barriers300IX, XIIcollectives
XIVMegascale (Multi-Host / DCN)210XII, XIIIcollectives / DCN
XVProfiling & Telemetry220XI, XIIprofiler
XVIConfiguration & Compile Knobs161V, VIIconfig
XVIIAppendices200allcross-cutting
Total4263

Per-generation navigation cross-index

The book is heavily per-generation. To trace one silicon generation end-to-end, follow its row:

TpuVersionCodenameCloud / marketingFamily pageISA bundleMXU latencyPerformance grid
0JellyfishTPU v2targets/jxc-family.mdisa/bundle-jf-41b.mdcost/mxu-latency-jf-df.mdcost/performance-jf-df.md
1DragonfishTPU v3targets/jxc-family.mdisa/bundle-df.mdcost/mxu-latency-jf-df.mdcost/performance-jf-df.md
2PufferfishTPU v4targets/pxc-family.mdisa/bundle-pf-51b.mdcost/mxu-latency-pf.mdcost/performance-pf.md
3ViperfishTPU v5 / v5etargets/vxc-family.mdisa/bundle-vf-64b.mdcost/mxu-latency-vf.mdcost/performance-vf.md
4GhostliteTPU v6e (Trillium)targets/gxc-family.mdisa/bundle-gl.mdcost/mxu-latency-gl.mdcost/performance-gl-ghperf.md
56acc60406TPU7xtargets/gxc-family.mdisa/bundle-gf.mdcost/mxu-latency-gf.mdcost/performance-gf-ghperf.md

The one-page consolidated constants table is Per-Gen Master Comparison Matrix.

Reading paths

  • Reimplement the cost model / scheduler — IV (silicon constants) → VI (ISA) → VII (cost data) → VIII (scheduling).
  • Understand TPU-to-TPU collectives — IV → XII (fabric + routing) → XIII (collective algorithms) → XIV (multi-host).
  • Parse a compiled program / bundle bytes — VI (ISA encoding) → X (memory & DMA) → XI (runtime load/exec).
  • Write or debug a PJRT consumer — II (PJRT API) → III (Tpu C-shim) → XI (execution) → XV (profiling).
  • Trace one TPU generation end-to-end — use the per-gen cross-index above: family (IV) → bundle (VI) → MXU latency + perf grid (VII).
  • Debug a hang / deadlock — XIII (barriers + SFLAG) → X (continuation-queue) → XII (VC-balance + routing).
  • Just get oriented — 0 (Reference Apparatus, esp. the Compile-Flow Walkthrough) → I (Binary Anatomy) → IV (codename model).

Conventions

  • Function addresses are virtual addresses (@0x…); for .text/.rodata/.lrodata, VA == file offset.
  • Each page carries a References block: the source binary and the function/symbol virtual addresses it cites.

NOTE — the VA == file-offset rule holds only for .text/.rodata/.lrodata. For .data the file offset is VA − 0x400000, and for .data.rel.ro it is VA − 0x200000; seeking with xxd/objdump at the raw VA for a struct that resides in those sections reads the wrong bytes. The full section map is in ELF Anatomy.

The source corpus

Every page in this book is derived from static analysis of libtpu.so — its symbol table, disassembly, and decompilation. The complete input set that the analysis ran against, down to the byte, is inventoried in the Source-Corpus Map; the methodology that produced and consumed it is described in Methodology (Deep).


Master Index

Part 0 — Reference Apparatus (9)

Orientation and connective tissue. Read the Compile-Flow Walkthrough first — it traces one matmul through every part and is the on-ramp to the whole book.

  • index.mdLanding / This Reference · I
    What libtpu is, binary provenance, organization, per-gen cross-index, reading paths.
  • front/how-to-read.mdHow to Read This Book · I
    Evidence grades, the dependency-flow rationale, the reading-path personas.
  • front/compile-flow-walkthrough.mdCompile-Flow Walkthrough · I
    One dot op traced HLO → MHLO → tpu → LLO → bundle bytes → execution, cross-referencing every part. src: P-2-03, P-2-01
  • methodology.mdMethodology · I
    Extraction pipeline, IDA sidecars, FLIRT, protoc --decode_raw, naming conventions. src: W001
  • subsystem-map.mdSubsystem Map · I
    The 13-domain dependency web and how the 18 parts cover it. src: W001, G001–G003
  • front/codename-cheatsheet.mdCodename Cheat-Sheet · C
    TpuVersion 0–5 ↔ codename ↔ chip-DID ↔ Cloud name on one card.
  • glossary.mdGlossary · I
    LLO, MXU, XLU, EUP, SCS/TAC/TEC, MRB, SFLAG, ICI/DCN and the rest.
  • front/evidence-conventions.mdEvidence & Confidence Conventions · I
    The C/I/O grades and the anchor format used throughout. src: W001
  • bibliography.mdBibliography · I
    External references; explicit note on what is not in the binary (Trillium/Ironwood are external names only).

Part I — Binary Anatomy (12)

How the 745 MB ELF is laid out and navigated. Analysis and orientation only; the large enumerated catalogs live in Part XVII.

  • forensics/overview.mdOverview · I
    The two binaries, the section model, why it is this large. src: W023, W027
  • forensics/elf-anatomy.mdELF Anatomy · C
    52 sections, segments, the VA==offset rule, .lrodata/.rodata/.text extents. src: W023, W030
  • forensics/two-binary-split.mdlibtpu.so + sdk.so · C
    The 884,832-fn main object and the 94,732-fn sdk; symbol-population shape. src: W001, W026
  • forensics/custom-sections.mdCustom Sections · C
    google_malloc, protodesc_cold, filewrapper_toc, __rseq_cs, __lcxx_override. src: P-2-07, W030
  • forensics/embedded-library-atlas.mdEmbedded-Library Atlas · C
    Vendored Abseil/protobuf/Eigen/oneDNN/tcmalloc/LLVM byte-accounting (FLIRT).
  • forensics/llvm-mlir-manifest.mdLLVM/MLIR Version Manifest · C
    Embedded toolchain version + the component list.
  • forensics/static-init.mdStatic-Init Pipeline · I
    ~2,900 ctors, init ordering, the plugin-discovery hooks.
  • forensics/trailing-zstd-blob.mdTrailing zstd Blob · C
    Why no trailing payload exists: the "4.1 MB dictionary blob" false-positive correction (ZSTD-01). src: P-2-30
  • forensics/dispatch-table-taxonomy.mdDispatch-Table Taxonomy · C
    40,313 tables → 19 classes (MLIR Op-Model, UFB pools, libpfm4, dnnl/Xbyak…). src: P-2-06
  • forensics/rtti-vtable-census.mdRTTI ↔ Vtable Cross-Validation · C
    Every typeinfo mapped to its vtable; the namespace census.
  • forensics/per-gen-function-dispatcher.mdPer-Generation Function Dispatcher · C
    The util_registration::FunctionRegistry dispatch engine.
  • forensics/polymorphic-entry-points.mdPolymorphic Dispatch Entry Points · C
    The indirect-call sites + the thunk-table and top-vtable classes.

Part II — Plugin Lifecycle & PJRT API (23)

The outer ABI: how the plugin loads and the 140-slot PJRT_Api struct JAX/PyTorch consume.

Lifecycle

  • lifecycle/overview.mdOverview · I
    From dlopen to a usable client.
  • lifecycle/elf-entry-and-init-proc.mdELF Entry & init_proc · C
    _init, .init_array, the GOT/PLT bring-up.
  • lifecycle/do-init-do-fini.mddo_init / do_fini · C
    Constructor/destructor ordering and global state.
  • lifecycle/get-pjrt-api-thunk.mdGetPjrtApi Thunk & tpu_plugin Object · C
    @0xe6a83a0 trampoline → GetTpuPjrtApi → 17 __cxa_guard blocks. src: P-2-05
  • lifecycle/tftpu-initialize-bootstrap.mdTfTpu_Initialize Bootstrap · I
    The initialize entry and option ingest.
  • lifecycle/module-init-plugin-discovery.mdModule-Init & Plugin Discovery · C
    How the PJRT plugin is registered and found.

PJRT_Api surface

  • pjrt/overview.mdOverview · I
    The C-API version, the extension-chain idea, .lbss storage.
  • pjrt/api-vtable-reconstruction.mdPJRT_Api 140-Slot Reconstruction · C
    Every slot → libtpu impl VA, v0.103 schema, the @0x227BA840 1120-B table. src: P-2-05
  • pjrt/client-and-device.mdClient, Device & Topology · C
    PJRT_Client_*, device enumeration, addressable devices.
  • pjrt/buffer-and-memory.mdBuffer ABI & Memory Layouts · C
    PJRT_Buffer_*, on-device layout, external refcounting.
  • pjrt/executable-execution.mdExecutable Loading & Execution · C
    compile → load → execute, serialization. src: P-022
  • pjrt/events-and-async.mdEvents & Async Tracking · I
    PJRT_Event Await/OnReady/IsReady. src: P-2-09
  • pjrt/collectives-communicator.mdCollectives Communicator · C
    CreateCommunicators, cross-host handles.
  • pjrt/dma-and-cross-host-recv.mdDMA Map & Cross-Host Receive · C
    The DMA-map slots + cross-host buffer receive.
  • pjrt/callbacks.mdCallbacks & Pre-Fatal Hook · I
    Host-callback registration, the pre-fatal hook. src: P-2-09
  • pjrt/extension-chain.mdExtension Chain (17) · C
    The linked extension list and how it is walked.
  • pjrt/ext-profiler.mdExtension: Profiler (type 1) · C
    PLUGIN_Profiler_Api 8 slots.
  • pjrt/ext-topology-description.mdExtension: TopologyDescription (type 16) · C
    The TPU topology query.
  • pjrt/ext-rawbuffer.mdExtension: RawBuffer (type 8) · C
    Raw device buffers.
  • pjrt/ext-compile-phasecompile.mdExtension: Compile / PhaseCompile (type 9) · C
    The compile-options flow.
  • pjrt/ext-remaining.mdExtensions: Layouts / Memories / Stream / FFI / … · C
    The remaining chain entries.
  • pjrt/stream-executor-host-interpreter.mdStreamExecutor Host Interpreter · C
    The HloEvaluator host fallback.
  • pjrt/stream-executor-pjrt-adapter.mdStreamExecutor → PJRT Adapter · C
    xla::TpuClient / the CommonPjRt framework.

Part III — Tpu C-Shim Layer (10)

The inner C ABI between PJRT and the runtime/compiler: the Tpu* C functions that wrap the C++ internals.

  • shim/overview.mdOverview · I
    The ~200-symbol Tpu* C surface and how PJRT calls it. src: P-2-05, P-2-09
  • shim/tpu-compiler-roster.mdTpuCompiler Roster · C
    The compile entry points.
  • shim/tpu-executable-roster.mdTpuExecutable Roster · C
    Executable handle ops. src: P-022
  • shim/tpu-executor-roster.mdTpuExecutor Roster · C
    Stream/executor C functions.
  • shim/tpu-transfer-manager.mdTpuTransferManager Roster · C
    The host↔device transfer C ABI.
  • shim/tpu-program-roster.mdTpuProgram Roster · C
    Program object + serialization C ABI. src: P-022
  • shim/tpu-platform-and-topology.mdTpuPlatform & TpuNodeContext · C
    Platform init, node context.
  • shim/tpu-topology.mdTpuTopology & TpuCoreLocation · C
    The topology/core-location C ABI.
  • shim/tpu-embedding-engine.mdTpuEmbeddingEngine ABI · C
    The embedding/SparseCore C surface.
  • shim/tpu-configuration-api.mdTpuConfigurationApi · C
    Runtime configuration entry points.

Part IV — Silicon & Hardware Codename Model (24)

The hardware the whole compiler is parameterized by. Read before V–VIII: the cost model, ISA, and MSA defaults all key off the per-codename constants defined here. Canonical for: per-gen hardware constants (referenced by VI, VII, IX).

Codename identity

  • targets/overview.mdOverview · I
    Six generations, three HAL families, the dual-enum trap. src: P-2-08
  • targets/tpu-version-codename-matrix.md6-Codename Authoritative Reconciliation · C
    The 6-axis (enum / CHECK / chip_parts / PCI / namespace / marketing) cross-check; settles glc=v4, gfc=v5.
  • targets/dual-enum-proto-vs-internal.mdDual Enum (Proto vs Internal) · C
    TpuVersionProto = internal + 1; the "chip_parts v6" resolution. src: P-2-08
  • targets/pci-device-ids.mdPCI Device IDs · C
    Chip DIDs 0x00d1/0x00f2, header DIDs, rev-masks, IsGlc/IsGfc → device-type.
  • targets/marketing-cloud-naming.mdMarketing / Cloud Naming · C
    v2…v6e/tpu7x; Trillium=v6e; Trillium/Ironwood NOT in the binary.
  • targets/codename-superseded-labels.mdSuperseded-Label Correction List · C
    The "v5p"/"Trillium" mislabels and the Ghostfish gloss.

HAL families

  • targets/hal-families.mdHAL Families · C
    jxc/pxc/vxc factories; glc/gfc under gxc; per-family Register immediates. src: P-2-08
  • targets/hal-factory-override-matrix.mdHAL Factory Override Matrix · C
    Which methods each family overrides vs inherits. src: P-2-35
  • targets/tpuhal-class-hierarchy.mdTpuHal Class Hierarchy · C
    Only TpuHalHardware* exists; the {Jxc,Pxc,Vxc}HardwareImpl tree.
  • targets/jxc-family.mdJXC Family (Jellyfish, Dragonfish) · C
    Identity, factory, PCI, cores.
  • targets/pxc-family.mdPXC Family (Pufferfish) · C
    Identity, factory, BarnaCore binding.
  • targets/vxc-family.mdVXC Family (Viperfish) · C
    Identity, std/lite variants, factory.
  • targets/gxc-family.mdGXC Family (Ghostlite, 6acc60406) · C
    glc/gfc namespaces, the anonymous v5 codec, chip DIDs.
  • targets/sub-core-taxonomy.mdSub-Core Taxonomy (GFC/GLC/JXC/PXC/VFC/VLC) · C
    The sub-family encoder split. src: P-2-08

Per-codename hardware constants

  • targets/chip-parts-binarypb.mdchip_parts.binarypb Decode · C
    TpuChipPartsProto, the embedded v5/v7 blobs, the variant allow-list.
  • targets/per-codename-hw-constants.mdPer-Codename Constant Table · C
    The master integer source consumed by the cost model and ISA.
  • targets/tpu-topology-struct.mdTpuTopology Struct (Target+0x3b8) · C
    Per-codename chip geometry.
  • targets/tpu-chip-config.mdTpuChipConfig · C
    LaneCount/SublaneCount/ChunksPerTile, VexMatrixWidth.
  • targets/sparsecore-target-descriptor.mdSparseCoreTarget (Target+0x948) · C
    The per-codename MXU-contracting-depth map.
  • targets/target-capability-bitfield.mdTarget Capability Bitfield (Target+0x628) · C
    The 2 live capability bits across the whole struct.
  • targets/kdevicetypeinfo-spec-constants.mdkDeviceTypeInfo Spec-Constants · C
    ~40 IEEE-754 doubles + the two DVFS frequency ladders.
  • targets/accuracy-tables.mdPer-Gen Accuracy Tables · C
    Transcendental approximation accuracy driving precision decisions.

Memory model primer (detailed allocators in Part X)

  • targets/memory-hierarchy.mdMemory Hierarchy · I
    HBM/VMEM/SMEM/CMEM/SFLAG tier model + the 17 MemorySpace values. src: P-2-01
  • targets/address-space-ids.mdAddress-Space ID Table (AS0–AS9) · C
    Incl. the SparseCore fat-pointer AS7/8/9.

Part V — Compiler: Lowering & Optimization Passes (36)

The IR descent and the optimization passes. Silicon-parameterized (uses IV) but ISA-light: it lowers to LLO ops by name, not bits. Output is LLO IR; encoding is VI, cost/scheduling are VII/VIII.

Front-end and pipeline

  • compiler/overview.mdOverview · I
    DeepseaCompilerBase::RunHloPasses, compile phases 0–3. src: P-2-03
  • compiler/hlo-ingestion.mdHLO Ingestion · C
    StableHLO → HLO, the entry module. src: P-002
  • compiler/compile-phases.mdCompile Phases 0–3 · C
    Phase0Stablehlo … Phase3Linking. src: P-2-03
  • compiler/compilation-cache.mdCompilation Cache · C
    Keys, fingerprints, the hit path. src: P-022
  • compiler/hlo-pass-registry.mdHLO Pass Registry · C
    The three pipeline containers + the xla_* flag-atlas binding.
  • compiler/hlo-pre-passes.mdHLO Pre-Passes · C
    The passes that run before TPU lowering.
  • compiler/sharding-propagation.mdSharding Propagation · C
    GSPMD sharding inference.
  • compiler/auto-sharding-spmd.mdAuto-Sharding / SPMD · C
    The partitioning passes.
  • compiler/algebraic-simplifier.mdAlgebraic Simplifier · I
    TPU-specific algebraic rewrites.
  • compiler/dynamic-shape-support.mdDynamic-Shape Support · C
    Bounded-dynamic model, dimension-size ops.
  • compiler/optimization-barrier.mdOptimization Barrier · C
    Insertion, honouring, erasure.
  • compiler/custom-call-lowering.mdCustom-Call Lowering & Registry · C
    The target catalog + the registration side.

MLIR lowering chain

  • compiler/mhlo-xtile-tpu-lowering.mdMHLO → XTile → tpu · C
    The dialect-chain conversion.
  • compiler/mosaic-overview.mdMosaic Overview · C
    The kernel → tpu → LLO path.
  • compiler/mosaic-layout-inference.mdMosaic Layout Inference · C
    VectorLayoutInferer per-op rules.
  • compiler/mosaic-vectorlayout.mdMosaic VectorLayout · C
    The (sublane, lane) layout algebra.
  • compiler/tpu-dialect-and-ops.mdThe tpu MLIR Dialect · C
    Ops, attributes, the 157-op surface.
  • compiler/tpu-to-llo-ods.mdtpu → LLO Lowering · C
    Per-op ODS operand/result/attr signatures, gain/staging registers.
  • compiler/dialect-conversion-legalizer.mdDialectConversion Legalizer · C
    The depth-aware legalization cost.
  • compiler/conversion-pattern-rewriter.mdConversionPatternRewriter · C
    The rollback/rewrite-log engine, 1:N patterns.
  • compiler/lower-to-mlo-dma-bridge.mdLowerToMlo DMA Bridge-Cast · C
    The two-stage DMA lowering.
  • compiler/lower-to-sparsecore-llvm.mdLowerToSparseCoreLlvm · C
    Per-class rewrite bodies.
  • compiler/sc-type-converter.mdSCTypeConverter · C
    The addr-space → !llvm.ptr<addrspace> map.
  • compiler/mlir-op-model-contract.mdMLIR Op-Model Contract · C
    The 23-slot Model<Op> (6,050 instances).
  • compiler/llvmtpu-intrinsic-catalog.mdLlvmTpu Intrinsic Catalog · C
    The 1,356 tpu_* backend intrinsics.

Memory & layout optimization

  • compiler/msa-overview.mdMSA Overview · C
    The memory-space-assignment ILP pass.
  • compiler/msa-allocate-segment.mdMSA AllocateSegment · C
    The allocation body + config proto.
  • compiler/msa-per-version-defaults.mdMSA Per-Version Defaults · C
    Overlap ratios / outstanding-copy caps per gen.
  • compiler/msa-reservation-hbm-policy.mdMSA Reservation & HBM Policy · C
    MsaReservationPolicy / HbmPolicy field dicts.
  • compiler/layout-assignment.mdLayout Assignment · C
    FindMemoryMinimizingLayout weights + AddBackendConstraints.

Fusion, dot/conv, tiling

  • compiler/fusion-patterns.mdFusion Patterns · C
    The TPU-specific fusion class roster.
  • compiler/fusion-cost-model.mdFusion Cost Model · C
    Priority coefficients + the ShouldFuseImpl lambda set.
  • compiler/dot-conv-mxu-lowering.mdDot / Conv → MXU Lowering · C
    Tile-cost comparator + EmitFunctorEnum.
  • compiler/raggeddot-convolution.mdRaggedDot → Windowed Convolution · C
    FromRaggedDot / DynamicSliceMaskedConv geometry.
  • compiler/loop-tiling-unrolling.mdLoop Tiling & Unrolling · C
    TileKind rules + LoopConfig proto + the pipeline unroller.
  • compiler/tpu-program-serialization.mdTpuProgram Serialization · C
    The final compiled-program container. src: P-022, P-2-09

Part VI — TensorCore ISA & LLO Encoding (42)

The target representation: LLO IR and the per-generation VLIW bundle bit-layouts. Self-contained — read independently of the cost model. Bundle packing (LLO→bytes) is in VIII.

Foundations

  • isa/overview.mdOverview · I
    LLO IR: 462 opcodes, 17 memory spaces, the proto-descriptor source. src: P-2-01, P-2-25
  • isa/llo-opcode-enum.mdLloOpcode Enum (462) · C
    Categories: scalar / vector / EUP / reduction / MXU / transpose / DMA / sync / BarnaCore. src: P-2-25
  • isa/memory-space-enum.mdMemorySpace Enum (17) · C
    HBM/VMEM/SMEM/SFLAG/IMEM/CMEM/SC/HOST/PINNED. src: P-2-01
  • isa/bundle-model-overview.mdBundle Model · I
    Per-gen sizes (41/51/64 B), slot counts, bundles-per-DMA-chunk. src: P-2-04, P-2-34
  • isa/instbits-master-db.mdInstBits Master DB · C
    The LLVM-MC per-opcode base bits; the default-all-zero / no-RELA finding.
  • isa/instr-name-data.mdTPUInstrNameData / Descs / RegEncoding · C
    opcode→mnemonic, MCInstrDesc, the reg-encoding table.
  • isa/llo-opcode-to-proto.mdLloOpcode ↔ Proto · C
    The 462-entry map + the inverse ProtoToLloOpcode.
  • isa/mc-emitter.mdMC-Emitter (getBinaryCodeForInstr) · C
    The insertBits operand path + the HwMode select.
  • isa/record-format.md239-Bit Record Format · C
    The APInt record + per-operand insertBits(value, pos, width).

Per-generation VLIW bundle

  • isa/bundle-jf-41b.mdJellyfish 41-Byte Bundle · C
    The full slot map (EncodeBundleInternal).
  • isa/bundle-df.mdDragonfish Bundle · C
    The JF/DF shared 41-B layout deltas.
  • isa/bundle-pf-51b.mdPufferfish 51-Byte Bundle · C
    EncoderPf + the 5 shared load/store sub-encoders.
  • isa/bundle-vf-64b.mdViperfish 64-Byte Bundle · C
    Quad VALU, triple vload, ScalarSubBundle.
  • isa/bundle-gl.mdGhostlite Bundle · C
    The vector_misc slot; the glc encoder.
  • isa/bundle-gf.md6acc60406 Bundle · C
    The dedicated predicates slot; the gfc encoder.

Per-slot encoding

  • isa/slot-mxu.mdMXU Slot · C
    matmul/matpush issue, latch fields, the per-gen MXU1 twin.
  • isa/slot-vpu.mdVPU (Vector-ALU) Slot · C
    All generations.
  • isa/slot-spu-scalar.mdSPU / Scalar Slot · C
    All generations.
  • isa/slot-sequencer.mdSequencer Slot · C
    branch/call/halt; the proto-bundle emitter path.
  • isa/slot-memory-load.mdMemory-Load Slot · C
    All generations.
  • isa/slot-memory-store.mdMemory-Store Slot · C
    All generations.
  • isa/slot-predicate.mdPredicate-Register File · C
    The 7-bit field (4+1+2), count per gen.
  • isa/slot-loop.mdHardware Loop-Counter · C
    Encoding + count per gen.
  • isa/slot-immediate.mdImmediate Slot · C
    Per-gen encoding-id → imm-slot bit position.
  • isa/slot-eup-transcendental.mdEUP / Transcendental Slot · C
    VectorResult + VALU3 bit positions.
  • isa/slot-matprep-iar-latch.mdMatprep / IAR / Latch · C
    Per-gen matprep WORD tables, the IAR bit-layout.
  • isa/slot-vcreate-mask-mregister.mdvcreate_mask / M-Register · C
    End-inclusive range, per-gen field offsets, M0–M31.
  • isa/slot-cmem-load-pf.mdcmem_load Slot (Pufferfish) · C
    The v4 constant-memory load path.
  • isa/slot-sparsity-v5plus.mdSparsity Slot (v5+) · O
    The structured-sparsity slot encoding. src: #1092 (open)

Encode / decode support

  • isa/v5plus-emitx-bit-positions.mdV5+ EmitX Absolute Bit Positions · C
    isa_emitter EmitX → BitCopy offsets (closes the InstBits gap).
  • isa/isa-emitter-registry.mdIsaEmitter Registry · C
    The (TpuVersion, SequencerType) pair-key cell census.
  • isa/decode-side-jf-pf.mdDecode-Side: JF / PF · C
    The disassembler inverse.
  • isa/decode-side-vf-gxc.mdDecode-Side: VF / GXC · C
    The −20-bit twin decode.
  • isa/nop-canonical.mdNOP / Unused-Slot Canonical Encoding · O
    Per-gen NOP templates. src: #1096 (open)
  • isa/tpumcimm-syimm32.mdTPUMCImm / SyImm32 Operand · C
    The MC immediate operand encoding + PatchOverlay.
  • isa/archregno-numbering.mdArchRegno Runtime Numbering · C
    ToArchRegno / InitRegisterNumbering per gen.
  • isa/kisatable-data-sections.mdkIsaTable Data Sections · C
    The per-gen ISA-encoding split (no literal kIsaTable symbol).
  • isa/sequencer-ops-per-gen.mdSequencer Ops Per Gen × Type · C
    Control-flow op rosters.
  • isa/resultfifo-archregister.mdResultFifo & ArchRegister Enums · C
    25 result FIFOs + the 0x32-entry arch-register enum.
  • isa/bias-quantization-helpers.mdBias-Add & Quant/Dequant Helpers · C
    The TPU bias/quantization helper functions.
  • isa/xlu-op-roster.mdXLU Op Roster · C
    Vsetperm/Vxpose/Vpermute/… opcode→factory table.
  • isa/pack-unpack-precision.mdPack/Unpack Precision · C
    VpackBf16 / VunpackCF32 bf16↔f32 conversion + segmented-reduce RPU.

Part VII — Cost & Latency Model (41)

What every instruction costs. The largest data surface in the binary (51 source files). Consumed by the schedulers in VIII; depends on the ISA (VI) and silicon constants (IV).

Core model

  • cost/overview.mdOverview · I
    The Performance / CycleTable / LatencyTable family architecture. src: P-2-04
  • cost/resource-enum.mdResource Enum (23-slot) · C
    Names + SubsetOptions partition + TC-frequency wiring.
  • cost/per-opcode-cycle-constants.mdPer-Opcode Cycle Constants · C
    Per-gen cycle-table dispatch.
  • cost/normalized-computation-cost.mdNormalizedComputationCost · C
    opcode→weight switch + GetCyclesIfFused.
  • cost/gethloresources-routing.mdGetHloResources Routing · C
    Per-op → ResourceVector sub-emitter routing.
  • cost/tpu-hlo-cost-analysis.mdTpuHloCostAnalysis · C
    The flop-override surface.
  • cost/bundle-aware-cost.mdBundle-Aware Cost · C
    VLIW bundle-issue cost.
  • cost/memory-bandwidth-latency-model.mdMemory Bandwidth & Latency Model · C
    The full cross-tier matrix per gen.
  • cost/local-dma-bandwidth.mdLocalDmaBandwidth · C
    Per-gen matrix + the MemXfer-latency consumer.

MXU latency (per-gen reservation matrices)

  • cost/mxu-latency-overview.mdMXU Latency Overview · C
    MxuResource enum + the reservation-matrix concept.
  • cost/mxu-latency-jf-df.mdMXU Latency: JF / DF · C
    Oldest-gen reservation rows.
  • cost/mxu-latency-pf.mdMXU Latency: PF · C
    Pufferfish reservation rows.
  • cost/mxu-latency-vf.mdMXU Latency: VF · C
    Full Viperfish reservation matrix value-by-value.
  • cost/mxu-latency-gl.mdMXU Latency: GL (Ghostlite) · C
    The GLM reservation rows.
  • cost/mxu-latency-gf.mdMXU Latency: GF (6acc60406) · C
    res-remap 3/8 + fp8-fnuz.
  • cost/matmul-mode-modifiers.mdMatmulMode & Modifiers · C
    16-ordinal naming, Matmul/MatpushModifier array<19> values.
  • cost/mxu-opholdissues-stall.mdMxuOpHoldIssues Stall Recurrence · C
    The stall formula + the balancing gate.

Performance grids (per-gen Instruction × Resource)

  • cost/performance-overview.mdPerformance Family Overview · I
    The per-gen Performance variant model.
  • cost/performance-jf-df.mdPerformance: JF / DF · C
    Full latency array + I×R grid.
  • cost/performance-pf.mdPerformance: PF · C
    20-resource grid + the BarnaCore variant1.
  • cost/performance-vf.mdPerformance: VF · C
    The Viperfish grid.
  • cost/performance-gl-ghperf.mdPerformance: GL (GhPerf 476×31) · C
    The Ghostlite occupancy grid.
  • cost/performance-gf-ghperf.mdPerformance: GF (GhPerf 465×31) · C
    The 6acc60406 occupancy grid.

CycleTable

  • cost/cycletable-family.mdCycleTable Family · C
    LatencyTable::Create(TpuVersion) factory dispatch.
  • cost/jf-cycletable.mdJfCycleTable · C
    offsetLUT transcription + 7-column Resource naming.
  • cost/vf-cycletable.mdVfCycleTable · C
    The 32-entry CT→(instr, res) dump + throughput bridge.

EUP / transcendental latency

  • cost/eup-latency-overview.mdEUP Latency Overview · C
    The push→pop software-pipelining model.
  • cost/eup-per-gen-integers.mdEUP Per-Gen Latency Integers · C
    PF/VF/GL push→pop integers.
  • cost/eup-paynehanek.mdEUP Payne-Hanek Range Reduction · C
    The 2/π table.
  • cost/eup-correction-coeffs.mdEUP Correction Coefficients · C
    Newton / VfastTwoSum per-function polynomials.
  • cost/eup-lane-width-unpack.mdEUP Lane-Width / Unpack · C
    AluEpOpLowering unpack → compute → pack.

XLU cost

  • cost/xlu-conflict-penalty.mdXLU Conflict-Penalty Table · C
    The non-MXU hazard table.
  • cost/xlu-combine-sourcebus.mdXLU Combine / Source-Bus · C
    ComputeCombinablePairs + AssignSourceBus.
  • cost/xlu-reemit-cost.mdXLU Reemit Cost · C
    Closed-form CyclesAddedByXluOperation + PerXluOperations.
  • cost/xpose-reservation-latency.mdTranspose-Reservation Latency · C
    XposeXLUReservationLatency + VxposeMode.

Conv / window cost

  • cost/window-description-cost.mdWindowDescription Byte-Cost · C
    The conv/DMA byte+throughput primitive.
  • cost/convolution-cost-state.mdConvolutionCostState · C
    Field map + VfCycleTable bridge.
  • cost/reduce-window-pooling-cost.mdReduce-Window / Pooling Cost · C
    RecordReduceWindowCycles.

Misc cost

  • cost/learned-cost-model-client.mdLearned Cost-Model Client · C
    EmitterLearnedCostModelOptions + the wiring status.
  • cost/cost-model-logging.mdCost-Model Logging · C
    The impure AutoOr consumer + the float grammar.
  • cost/iars-per-tensorcore.mdConsolidated Per-Gen Counts · C
    IarsPerTensorCore / mxu / xlu counts in one table.

Part VIII — Instruction Scheduling & Bundle Packing (14)

The algorithms that consume the cost model (VII) and emit ordered, packed bundles (VI). Smaller than VII by design — in this corpus the scheduling algorithms are a fraction of the cost data.

  • sched/overview.mdOverview · I
    Where scheduling sits between lowering and encoding.
  • sched/latency-hiding-scheduler-core.mdLatencyHidingScheduler Core · C
    ScheduleComputation candidate loop + the async tracker.
  • sched/lhs-post-layout-pre-fusion.mdLHS: post_layout_pre_fusion Variant · C
    The early scheduling variant.
  • sched/lhs-post-layout.mdLHS: post_layout / final Variant · C
    The final scheduling variant.
  • sched/lhs-ilp-variant.mdLHS: ILP Variant · C
    The two flag-gated code paths.
  • sched/scheduler-resourcetype-model.mdResourceType Taxonomy · C
    Per-resource model + AsyncTracker → core registry.
  • sched/bundle-modulo-scheduling.mdBundle Modulo Scheduling · C
    The II-search + software pipelining.
  • sched/llo-bundle-packing.mdLLO → Bundle Packing · C
    The final-stage slot-assignment algorithm.
  • sched/mxu-assignment-binpacker.mdMXU Assignment Bin-Packer · C
    AssignMxusForSequenceGroup.
  • sched/latch-assignment-overrun.mdLatch Assignment & Overrun · C
    SetLatchIndices + the per-gen overrun handshake.
  • sched/mxu-sequence-struct.mdMxuSequence / SequenceInfo · C
    The full record + set_mxu commit.
  • sched/mrb-chain-allocator.mdMRB Chain Allocator · C
    The reservation-timeline algorithm + jitter model.
  • sched/mrb-fifo-msr-placement.mdMRB FIFO / MSR Placement · C
    AllocateMrbEntriesAsFifo + BounceBetweenMsrs.
  • sched/encoder-latch-serialization.mdPer-Gen Encoder Latch Serialization · C
    How latch fields serialize into the per-gen bundle.

Part IX — SparseCore & BarnaCore (45)

The embedding/sparse engine (SparseCore, v5+) and its retired predecessor (BarnaCore, v2–v4). Kept whole rather than sliced across the ISA/cost/scheduling axis. The collective-offload story lives in Part XIII.

SparseCore engines

  • sparsecore/overview.mdOverview · I
    SCS/TAC/TEC, the 2-sequencer (SCS+TEC) model. src: P-2-02
  • sparsecore/architecture.mdArchitecture · C
    Engine roles + the embedding datapath.
  • sparsecore/scs-engine.mdSCS (Scalar) Engine · C
    The scalar sequencer engine.
  • sparsecore/tac-engine.mdTAC Engine · C
    The codec-only role.
  • sparsecore/tec-engine.mdTEC (Vector) Engine · C
    The vector execution engine.
  • sparsecore/bundle-slot-base-map.mdPer-Engine Bundle Slot-Base Map · C
    SCS/TAC/TEC byte offsets.
  • sparsecore/region-to-sequencer-outliner.mdRegion → Sequencer Outliner · C
    Partitions an SC computation into per-engine bundles.
  • sparsecore/getsequencertype.mdgetSequencerType · C
    Engine selection (SCS/TAC/TEC).

SparseCore ISA

  • sparsecore/scalar-opcode-enum.mdScalar Opcode Enum · C
    ScsScalarMisc / ScalarAlu0 / ScalarAlu1.
  • sparsecore/vector-opcode-enum.mdVector Opcode Enum · C
    VF 148-op / GF 257-op VectorAlu.
  • sparsecore/oneslot-router.mdOneSlot Scalar Router · C
    ConsumeOneSlotInstruction jump table.
  • sparsecore/vectorload-slot.mdVectorLoad Slot · C
    5-op field layout + the SourceOne seed enum.
  • sparsecore/vectorstore-slot.mdVectorStore Slot · C
    The 33-entry type×mode scatter matrix.
  • sparsecore/vectorextended-vex.mdVectorExtended / VEX · C
    The 53-op scan/sort/dedup family.
  • sparsecore/vex-operand-port.mdVEX Operand-Port Binding · C
    FindAndEmitToUnusedPort (generation-specific).
  • sparsecore/vex-mask-destport-subopcode.mdVEX Mask / Dest-Port / Sub-Opcode · C
    The bit0x104 mask field + the sub-opcode map.
  • sparsecore/m-register-predicate.mdM-Register Predicate Word (M0–M31) · C
    Masked-scan inactive semantics.
  • sparsecore/cbreg.mdCBREG Circular-Buffer Register · C
    Bit layout, addressing, wrap.

SparseCore datapath (embeddings)

  • sparsecore/scan-datapath.mdScan Datapath · C
    Mask consumption + ScanOp lowering.
  • sparsecore/segmented-scan.mdSegmented Scan · C
    SegmentedScanOpLowering reduction_op switch.
  • sparsecore/segmented-add-scan.mdSegmented-Add-Scan · C
    The newer-gen segment-reduce family.
  • sparsecore/embedding-minibatching.mdEmbedding Minibatching Decomposition · C
    The HLO layer above scan lowering.
  • sparsecore/sample-combiner-emitter.mdSampleCombiner Emitter · C
    The inner-loop combiner emit.
  • sparsecore/emit-valency-loop.mdEmitValencyLoop · C
    The per-sample valency loop.
  • sparsecore/rank-and-permute-radixsort.mdRankAndPermute / RadixSort · C
    The sort/permute compute function.
  • sparsecore/dedup-multiplicity.mdDedup Multiplicity · C
    DuplicateCount→multiplicity + Uniquify inverse-permutation.

SparseCore pointers & DMA

  • sparsecore/fat-pointers-as789.mdFat Pointers (AS7/8/9) · C
    160/128/192-bit structured-pointer constructors.
  • sparsecore/addrspacecast-isel.mdaddrspacecast ISel · C
    The 16-cast from→to AS map.
  • sparsecore/tile-id-cast.mdTile-ID Cast · C
    On-tile 2-operand cast lowering.
  • sparsecore/stream-gather-scatter.mdStream Gather/Scatter · C
    The indirect-DMA descriptor format.
  • sparsecore/indirect-vreg-stream.mdIndirectVregStream · C
    The VREG-loop form.

SparseCore back-end

  • sparsecore/sc-backend-pipeline.mdSC Backend Pipeline · C
    RunPasses, all 12 passes, the MEGACORE barrier.
  • sparsecore/sc-emitx-dispatcher.mdSC EmitX Dispatcher · C
    seq3/seq4/seq5 → EmitX jump tables.
  • sparsecore/sc-core-selection.mdSC Core Selection · C
    SelectCores / GetAllowedCores policy.
  • sparsecore/sc-queue-assignment-reservation.mdSC Queue Assignment & Reservation · C
    The resource→limit btree_map.
  • sparsecore/getsparsecoreconfig.mdGetSparseCoreConfig · C
    The offload op-type enum source.

SparseCore cross-cutting

  • sparsecore/sc-mxu-handshake.mdSC ↔ MXU Handshake · C
    The integration handshake.
  • sparsecore/sparsecore-vs-neuron-matmultsparse.mdSparseCore vs Neuron MatmultSparse · I
    Cross-vendor comparison.

BarnaCore (legacy v2–v4)

  • barnacore/overview.mdOverview · I
    The legacy embedding accelerator.
  • barnacore/retirement.mdRetirement Evidence · C
    The BarnaCore → SparseCore transition.
  • barnacore/bcs-scalar-isa.mdBCS Scalar0/Scalar1 ISA · C
    The 122-op control+memory ISA.
  • barnacore/bcs-32byte-bundle.mdBCS 32-Byte Bundle · C
    InstBits_BarnaCorePxcHwMode + BcsMetadataAccessor.
  • barnacore/merged-alu.mdMerged-ALU Bit Layout · C
    VectorResultDestination / BaseAddressEncoding.
  • barnacore/jf-df-address-handler-bundle.mdJF/DF 16-Byte Address-Handler Bundle · C
    EncoderJf::EncodeBarnaCoreAddressHandler.
  • barnacore/per-gen-perf-grids.mdPer-Gen BarnaCore Perf Grids · C
    PufferfishBarnaCorePerformance variant1.

Part X — On-Chip Memory & DMA (20)

The memory tiers' allocators and the DMA wire formats. The tier model is primed in IV; here are the allocator algorithms and descriptor byte layouts.

Memory tiers

  • memory/overview.mdOverview · I
    The five on-chip tiers + host memory. src: P-2-01
  • memory/hbm-allocator.mdHBM BestFit Allocator · C
    Coalescing rule + split/fragmentation policy.
  • memory/hbm-dma-alignment.mdHBM DMA Alignment Contract · C
    The minimum-alignment rule.
  • memory/vmem-allocator.mdVMEM Allocator · C
    Per-codename Config, alignment, MSA integration.
  • memory/smem-scalar-memory.mdSMEM Scalar Memory · C
    Allocator, addressing, placement.
  • memory/smem-register-window.mdSMEM Register-Window · C
    The mechanism + reconciliation with the SPU slot.
  • memory/cmem-pool.mdCMEM Constant-Memory Pool · C
    Layout, allocator, placement (Pufferfish+).
  • memory/sflag-protocol.mdSFLAG Sync-Flag Tier · C
    Allocator, Config, atomics, ordering.
  • memory/tpu-buffer-layout.mdTpuBuffer Layout · C
    On-device buffer structure.
  • memory/buffer-donation-aliasing.mdBuffer Donation & Aliasing · I
    DonateWithControlDependency. src: P-2-09
  • memory/on-device-compaction.mdOn-Device Compaction · I
    The defrag path.
  • memory/embedded-tcmalloc.mdEmbedded tcmalloc · C
    Host-CPU allocator integration + sizing.

DMA

  • dma/intra-chip-descriptor.mdIntra-Chip DMA Descriptor · C
    Format, tiling, tier-pair encoding.
  • dma/tile-index-expansion.mdTile-Index Expansion · C
    ExpandTiledMemRefs / expandTiledIndices algebra.
  • dma/rolled-strided-general.mdRolled / Strided / General Emitters · C
    issueRolled/Strided/General transfer bodies.
  • dma/dma-parameters-selector.mdDmaParameters Selector · C
    Simple vs SingleStrided + dim-coalescing.
  • dma/host-device-dma.mdHost↔Device DMA · C
    DeriveHostDmaTransfers + tags 6/7.
  • dma/uhi-host-interface.mdUHI Host-Interface DMA · C
    The wire format + QueueId semantics.
  • dma/oci-command-dma-id.mdOCI Command DMA-ID · C
    The 6 CmdDmaIdFromEntry helpers + the 3-header bands.
  • dma/continuation-queue.mdContinuation Queue · C
    Memory model + runtime SFLAG protocol + the halt model.

Part XI — Runtime & Execution (11)

How a compiled program runs on a stream. Consumes the ISA (VI) and memory (X).

  • runtime/overview.mdOverview · I
    The execute path from PJRT down to the stream. src: P-2-09, P-022
  • runtime/execute-async-on-stream.mdExecuteAsyncOnStream · C
    The core execution entry. src: P-022, P-2-09
  • runtime/load-program-enqueue.mdLoadProgramAndEnqueueToStream · C
    Program load + enqueue. src: P-022
  • runtime/stream-semantics.mdStream Semantics & Dependencies · I
    Ordering, dependencies. src: P-2-09
  • runtime/infeed-outfeed.mdInfeed / Outfeed Queues · I
    The host-feed queues. src: P-2-09
  • runtime/host-callbacks.mdHost Callbacks · I
    Callback dispatch during execution. src: P-2-09
  • runtime/completion-loop.mdCompletion Loop & AsyncTrackingEvent · I
    Completion tracking. src: P-2-09
  • runtime/allocator-integration.mdPJRT Client Allocator Integration · C
    Device-memory allocation flow.
  • runtime/error-templates.mdError/Status String Templates · C
    The printf-format + StrFormat catalog.
  • runtime/hint-strings.mdUser-Facing Hint Strings · C
    Actionable diagnostics (flag-suggestion / doc-link / capacity).
  • runtime/internal-pass-names.mdInternal Pass-Name Catalog · C
    HLO + MLIR + pipeline phase names.

Part XII — Interconnect & Routing (30)

The physical fabric and how packets route across it. The geometric substrate (twisted torus) that on-pod collectives (XIII) build on.

ICI fabric

  • ici/overview.mdOverview · I
    The inter-chip interconnect model.
  • ici/link-bringup.mdLink Bring-Up Sequence · C
    The link initialization sequence.
  • ici/topology-discovery.mdTopology Discovery · C
    Master::DiscoverTopology end-to-end.
  • ici/dma-descriptor.mdCross-Chip DMA Descriptor · C
    The ICI DMA wire format.
  • ici/all-reduce-primitive.mdICI All-Reduce Primitive · C
    The step-generation primitive.
  • ici/failure-recovery.mdFailure Modes & Recovery · C
    The recovery flow.
  • ici/vc-balance-allocation.mdVC-Balance Allocation · C
    Deadlock-free virtual-channel allocation.

Routing

  • routing/overview.mdOverview · I
    The route-generation → route-cache → emission pipeline.
  • routing/randomized-toroidal-wildfirst.mdRandomizedToroidalWildFirstPaths · C
    The path generator.
  • routing/route-table-generation.mdRoute-Table Generation · C
    physmap + GetPhysicalToLogicalMapping3D.
  • routing/get-static-path.mdGetStaticPath & Multipod · C
    Inter-pod route emission.
  • routing/toroidal-route-cache.mdToroidalRouteCache · C
    The 85-file binarypb decode + per-codename split.
  • routing/route-cache-decompress.mdRoute-Cache Decompress · C
    CompressedToroidalRouteCache proto→map.
  • routing/route-cache-dedup.mdRoute-Cache Dedup · C
    RouteCacheDeduplicator key + type dispatch.
  • routing/route-cache-codec.mdRoute-Cache Codec · C
    BitEncoder / DecodePathFromBits / TopologyRotationHelper.
  • routing/create-routing-schedule.mdCreateRoutingSchedule Solver · C
    The priority-queue hop-assignment + PointerType enum.
  • routing/net-router-pipeline.mdnet_router Pipeline · C
    The software-pipeline callbacks + Transfer construction.
  • routing/unicast-route-emission.mdUnicast Route Emission · C
    The layer above DmaDestinationRoutingTableEntryMapper.
  • routing/get-distances.mdGetDistances · C
    The nK twisted-torus distance metric.

Twisted torus geometry

  • twist/overview.mdOverview · I
    The twisted-torus topology and why it exists.
  • twist/buildstrategy.mdTwistedTorusND::BuildStrategy · C
    Phase order + RingLocation construction.
  • twist/twist-predicate-orientation.mdTwist Predicate & Orientation · C
    Orientation enum 4/5/6 negative-axis folding.
  • twist/replica-group-2phase.md2-Phase Replica-Group Construction · C
    The reduce-scatter / all-gather group construction.
  • twist/shape-folds.mdShape Folds · C
    K_K_2K / K_2K_2K / K_2K_NK twist-shape cases.
  • twist/get-replica-pair-3d.mdGetReplicaPair3DOnTwistedTorus · C
    The coordinate fold.
  • twist/megacore-even-odd.mdMegacore Even/Odd Split · C
    The split rationale.
  • twist/get-tiebreak.mdGetTiebreak · C
    The literal-nK routing tiebreak.
  • twist/sc-side-twist.mdSC-Side Twist · C
    GetPhase0/1Cores + EstimatePhysicalLinksUsed.

ICR node-fabric

  • routing/icr-node-fabric-dma.mdICR Node-Fabric DMA Bands · C
    trace_point_ids 48/50/51/91 timeline source.
  • routing/nf-descriptor.mdnf_descriptor (27-field) · C
    The Node-Fabric DMA descriptor record.

Part XIII — On-Pod Collectives & Barriers (30)

How a collective is decomposed, offloaded, and synchronized over the fabric (XII). The SparseCore-offload path bridges to IX.

Collective algorithms

  • collectives/overview.mdOverview · I
    The strategy picker and the algorithm family.
  • collectives/strategy-nd-picker.mdSelectNDStrategy · C
    The collective-algorithm picker + degraded-axis handling.
  • collectives/binomial-recursive-doubling.mdBinomial / Recursive-Doubling · C
    The per-rank partner schedule.
  • collectives/allreduce-hierarchical-pincer.mdAllReduce Hierarchical / Pincer · C
    The multi-phase 0x101 path + pincer fusion.
  • collectives/allgather-nd-ring.mdAllGather ND-Ring · C
    GetShardIndex/GetOffset + the 2D/3D selector.
  • collectives/alltoall-tables.mdAllToAll Tables · C
    GenerateAllToAllTables → ConstantMapper.
  • collectives/reduce-scatter.mdReduceScatter · C
    The reduce-scatter decomposition.
  • collectives/constant-mapper.mdConstantMapper · C
    Compile-time collective constant-pool tags + SMEM reads.
  • collectives/degraded-axis.mdDegraded-Axis Ingest · C
    TpuDegradedAxesProto fault-tolerant path.

SparseCore-offload collectives

  • collectives/sc-offload-config-builder.mdSC-Offload Config Builder · C
    ConstructConfigForCollectiveUniDirNDGroups.
  • collectives/hierarchical-kind.mdHierarchicalKind · C
    AllGather/AllReduce/ReduceScatter OffloadConfig structs.
  • collectives/tensor-split-ndplane.mdTensor-Split / ND-Plane · C
    tensor_split_factor / NumScOffloadDevices + NDPlaneInfo.
  • collectives/physical-core-placement.mdPhysical-Core Placement · C
    physical_core_indices per-color mapping.
  • collectives/sc-core-selection-offload.mdSC Core-Selection (Offload) · C
    The assignment cost + resource model.
  • collectives/get-remote-memref.mdget_remote_memref · C
    Cross-chip address composition.
  • collectives/start-remote-dma.mdStartRemoteDma · C
    The all-to-all producer + SubsliceToFullSliceGlobalCoreId.

SFLAG & barriers

  • barrier/overview.mdOverview · I
    The sync-flag-based barrier model.
  • barrier/special-purpose-sync-flags.mdSpecialPurposeSyncFlags · C
    The FromProto runtime sink + overlay semantics.
  • barrier/per-codename-compiler-reserved.mdPer-Codename compiler_reserved SFLAG · C
    The literal {base, count} integers.
  • barrier/barrier-coloring.mdBarrierColoring · C
    The greedy graph-coloring engine.
  • barrier/barrier-to-sflag-binding.mdBarrier → SFLAG Number Binding · C
    The compiler-barrier → hardware-SFLAG number map.
  • barrier/global-barrier-window.mdGlobal-Barrier SFLAG Window · C
    GetGlobalBarrierSyncFlagNumber consumers.
  • barrier/replica-barrier.mdReplica (type-2) Barrier · C
    The REPLICA barrier lowering.
  • barrier/tensorcore-barrier.mdTensorCore Barrier · C
    InitializeOnScs lookup-callback.
  • barrier/tree-barrier-vsync.mdTree-Barrier Vsync · C
    net_util actuation + InfoTable indexing.
  • barrier/infer-barrier-config.mdInferBarrierConfig · C
    The per-gen SFLAG map source.
  • barrier/remote-sflag-encoders.mdPer-Gen Remote-SFLAG Encoders · C
    GetRemoteSyncFlagEncoderRegistry + chip-id map.

Higher-level

  • collectives/megacore-fusion.mdMegacore Fusion · I
    The megacore collective fusion.
  • collectives/fp8-quantized-collective.mdFP8 Quantized Collective · C
    The quantized-collective dispatch path. src: #1339
  • collectives/spmd-link-count-cost.mdSPMD Link-Count Cost · C
    The link-count divisor + full collective cost-formula set.

Part XIV — Megascale (Multi-Host / DCN) (21)

The data-center-network layer above on-pod ICI: cross-host rendezvous, fleet metadata, and error aggregation.

  • megascale/overview.mdOverview · I
    DCN vs ICI; what Megascale orchestrates.
  • megascale/bootstrap/overview.mdBootstrap: Overview · C
    The rendezvous overview.
  • megascale/bootstrap/coordinator-election.mdBootstrap: Coordinator Election · C
    The coordinator-election logic.
  • megascale/bootstrap/worker-registration.mdBootstrap: Worker Registration · C
    Worker registration with the coordinator.
  • megascale/bootstrap/topology-exchange.mdBootstrap: Topology Exchange · C
    The cross-host topology exchange.
  • megascale/bootstrap/ici-handoff.mdBootstrap: ICI Handoff · C
    Handoff to the ICI fabric.
  • megascale/bootstrap/convergence.mdBootstrap: Convergence · C
    Convergence detection.
  • megascale/bootstrap/failure-handling.mdBootstrap: Failure Handling · C
    Bootstrap failure handling.
  • megascale/bootstrap/tpunetd-relationship.mdBootstrap: tpunetd Relationship · C
    Relationship to the tpunetd daemon.
  • megascale/fleet-metadata/overview.mdFleet Metadata: Overview · C
    The fleet-metadata schema overview.
  • megascale/fleet-metadata/topology-model.mdFleet: Topology Model · C
    The fleet topology model.
  • megascale/fleet-metadata/host-identity.mdFleet: Host Identity · C
    Host identity fields.
  • megascale/fleet-metadata/global-addressing.mdFleet: Global Addressing · C
    Global addressing scheme.
  • megascale/fleet-metadata/ici-vs-dcn.mdFleet: ICI vs DCN · C
    The ICI/DCN distinction.
  • megascale/fleet-metadata/slice-shape.mdFleet: Slice Shape · C
    Slice-shape encoding.
  • megascale/fleet-metadata/bootstrap-exchange.mdFleet: Bootstrap Exchange · C
    The bootstrap data exchange.
  • megascale/fleet-metadata/barrier-error-usage.mdFleet: Barrier & Error Usage · C
    How fleet metadata feeds barriers/errors.
  • megascale/fleet-metadata/field-decode.mdFleet: Field Decode · C
    Field-by-field decode.
  • megascale/cross-host-barrier.mdCross-Host Barrier · C
    The Megascale barrier primitive.
  • megascale/error-aggregator.mdErrorAggregator · C
    Wire format, scope, retention, dedup.
  • megascale/tpunetd-protocol.mdtpunetd Protocol · C
    The daemon protocol.

Part XV — Profiling & Telemetry (22)

How libtpu emits XPlane traces and hardware telemetry. Per-generation trace payloads have distinct on-wire formats.

  • profiling/overview.mdOverview · I
    XPlane, the trace pipeline, the codec families. src: P-2-32
  • profiling/tpu-profiler-abi.mdTpuProfiler ABI · C
    The profiler C surface.
  • profiling/pjrt-profiler-extension.mdPJRT_Profiler Extension · C
    PLUGIN_Profiler_Api.
  • profiling/xplane-xstat-traceme.mdXPlane / XStat / TraceMe Emission · C
    The emit path. src: P-2-32
  • profiling/tpu-telemetry-proto.mdtpu_telemetry.proto · C
    Field-by-field decode.
  • profiling/xevent-metadata-ids.mdXEvent Metadata IDs · C
    The profiler event catalog.
  • profiling/xstat-metadata-ids.mdXStat Metadata IDs · C
    The stat/attribute catalog.
  • profiling/trace-entries-coder.mdTraceEntriesCoder · C
    The fixed-width device-trace codec.
  • profiling/riegeli-trace-container.mdriegeli Trace Container · C
    Framing + timebase clock-domain conversion.
  • profiling/per-devicetype-struct.mdPer-DeviceType Profiler Struct · C
    The 0x448-byte master device table.
  • profiling/kdevicetypeinfo-producer-readers.mdkDeviceTypeInfo Producer / Readers · C
    The roofline readers.
  • profiling/tracepoints-master-registry.mdTracePoints Master Registry · C
    trace_point_id → {family, subscriber}.
  • profiling/trace-entry-to-xevent.mdTraceEntry → XEvent/XStat · C
    The TpuXLineBuilder last hop.
  • profiling/task-proto.mdTask Proto · C
    Device clock-rates + chip/host identity + GtcSpan offset.
  • profiling/payload-jxc-legacy.mdPayload: jxc Legacy · C
    The 16-bit trace_point_id namespace.
  • profiling/payload-vfc-vlc-gfc.mdPayload: vfc / vlc / gfc · C
    Per-gen payload field maps.
  • profiling/payload-sc-band.mdPayload: SparseCore Band · C
    SCS/TEC/TAC profiler payloads.
  • profiling/payload-uhi-oci-ici-dma.mdPayload: UHI/OCI/ICI/DMA · C
    The high-value trace-point bit-decodes.
  • profiling/icr-dma-timeline-band.mdICR DMA-Timeline Band · C
    The 48/50/51/91 rendering.
  • profiling/jxc-dma-hbmmux-brnperf.mdjxc DMA / HbmMux / brn_perf · C
    The jellyfish DMA bands.
  • profiling/v7x-perf-counters.mdv7x Perf-Counters · C
    The hardware-counter name resolver + firmware/DVFS telemetry.
  • profiling/dma-endpoint-rendering.mdDMA Endpoint Rendering · C
    SrcMem/DstMem/Opcode enums + XEvent rendering.

Part XVI — Configuration & Compile Knobs (16)

Every flag, env var, and compile knob, and how they resolve. The TpuCompilationEnvironment is the 1,121-field master config object.

  • config/overview.mdOverview · I
    The flag/knob/env taxonomy.
  • config/xla-flag-atlas.mdxla_ Flag Atlas* · C
    The full option-name catalog.
  • config/flag-families.mdFlag Families · C
    jf/pf/vf/gf/sc/msa/lhs prefixes.
  • config/env-vars.mdEnvironment Variables · I
    The env-var catalog. src: W005
  • config/tpu-compilation-environment.mdTpuCompilationEnvironment (1121 fields) · C
    Overview + DefaultDebugOptions.
  • config/tce-field-dictionary-a.mdTCE Field Dictionary (A) · C
    Fields part 1.
  • config/tce-field-dictionary-b.mdTCE Field Dictionary (B) · C
    Fields part 2.
  • config/tce-field-offsets-defaults.mdTCE Field-Offsets & Flag Defaults · C
    field#→offset + ABSL-flag defaults.
  • config/debugoptions-proto.mdxla.DebugOptions (290 fields) · C
    The complete proto table.
  • config/default-debugoptions.mdDefault DebugOptions · C
    The effective defaults.
  • config/autoproto-autoor-resolution.mdAutoProto / AutoOr Resolution · C
    The per-knob AUTO resolver bodies.
  • config/autoor-parse-grammar.mdAutoOr Parse Grammar · C
    ParseAutoOrFromString (the XLA_FLAGS ingest).
  • config/autoor-unparse.mdAutoOr Unparse · C
    AbslUnparseFlag reverse-text.
  • config/autoproto-message-arms.mdAutoProto Message-Arms · C
    The 12 message-arm sub-message defaults + SET path.
  • config/registry-mediated-flags.mdRegistry-Mediated Flags · C
    enable_lem_scheduler / explicit_evict_memory_limit_kib.
  • config/flag-prefix-dispatch.mdTpuVersion-Aware Flag-Prefix Dispatch · O
    The per-gen flag-prefix routing. src: #1171 (open)

Part XVII — Appendices (20)

Reference tables, the source-traceability index, and the open-frontier register.

  • appendix/llo-opcode-table.mdLloOpcode Table (462) · C
    The full enum with categories. src: P-2-25
  • appendix/llvmtpu-intrinsic-table.mdLlvmTpu Intrinsic Table (1356) · C
    The full tpu_* intrinsic list.
  • appendix/memory-space-table.mdMemorySpace Table (17) · C
    The full enumeration. src: P-2-01
  • appendix/dispatch-table-taxonomy-full.mdDispatch-Table Taxonomy (full) · C
    All 19 classes + the 40,313-table TSV. src: P-2-06
  • appendix/filewrapper-toc-catalog.mdfilewrapper_toc Catalog (61) · C
    Every embedded runtime resource. src: P-2-07
  • appendix/protodesc-cold-catalog.mdprotodesc_cold Catalog (760) · C
    Every embedded FileDescriptorProto. src: P-2-07
  • appendix/rtti-namespace-census.mdRTTI Namespace Census · C
    The full 160,351-entry breakdown.
  • appendix/reconstructed-proto-index.mdReconstructed-Proto Index · C
    Every proto recovered from the descriptor pool.
  • appendix/error-status-codes.mdError / Status Codes · C
    The status-code catalog.
  • appendix/flag-catalog-full.mdFlag Catalog (full TSV) · C
    The machine-readable flag list.
  • appendix/symbol-namespace-index.mdSymbol Namespace Index · I
    The namespace population map. src: W014, W003
  • appendix/per-gen-comparison-matrix.mdPer-Gen Master Comparison Matrix · C
    bundle/lanes/MXU/XLU/IAR/SFLAG/EUP/DID, six gens, one page.
  • appendix/evidence-anchor-index.mdEvidence-Anchor Index · I
    page → source-findings file → binary VA, the full traceability map. src: (this corpus)
  • appendix/source-corpus-map.mdSource-Corpus Map · I
    The P-2/P-3/W raw-findings file → part assignment. src: W001, W034
  • appendix/open-frontier-register.mdOpen-Frontier Register · I
    What is NOT yet recovered (cmem-load/sparsity edges; sparsity slot #1092; NOP canonical #1096; flag-prefix #1171).
  • appendix/cross-reference-graph.mdCross-Reference Dependency Graph · I
    The inter-page dependency web. src: G001–G003
  • appendix/binary-layout.mdBinary Layout Reference · C
    Segments, anchor symbols, scale vs ptxas/nvlink/cicc. src: W023, W027
  • appendix/methodology-deep.mdMethodology (Deep) · I
    Extraction pipeline, FLIRT, sidecar inventory. src: W001, W031
  • appendix/glossary-extended.mdExtended Glossary · I
    Every acronym + internal class name. src: W014
  • appendix/changelog.mdChangelog · I
    Book revision history vs binary version.

Appendix highlights

Two appendix pages are the connective tissue that make this book auditable and the per-generation story coherent:

  • Evidence-Anchor Index (appendix/evidence-anchor-index.md) — the full page → source-findings file (P-2-*/P-3-*/W*) → binary VA mapping. Every claim in the book is traceable back through this index to a specific function address and a specific raw-findings file. This is what separates a reimplementation reference from a blog post.
  • Per-Gen Master Comparison Matrix (appendix/per-gen-comparison-matrix.md) — one page, six generations (Jellyfish/Dragonfish/Pufferfish/Viperfish/Ghostlite/6acc60406) × every per-generation constant: bundle size, lane/sublane count, MXU dimensions, XLU/IAR counts, per-tier memory (VMEM/SMEM/SFLAG/CMEM/HBM), accelerator-core type, and the cost-model class trio (LatencyTable/CycleTable/Performance) that carries the per-gen EUP push→pop latency. Ties together the per-gen material otherwise distributed across Parts IV, VI, and VII.

Open frontier

Pages graded O are not yet backed by a completed raw-findings file: the sparsity slot encoding (task #1092), per-gen NOP canonical encoding (#1096), TpuVersion-aware flag-prefix dispatch (#1171), and the cmem-load/sparsity edges. The Open-Frontier Register tracks these. Everything else is backed by a C (confirmed) or I (inferred) raw-findings file.