Overview
Every offset, value, and address on this page was read byte-exactly from
libtpu.soin thelibtpu-0.0.40-cp314wheel (BuildID md589edbbe81c5b328a958fe628a9f2207d). Other versions differ.
Abstract
Part VI documents the TPU TensorCore instruction set and how the compiler encodes it. The TensorCore is a statically-scheduled VLIW machine: its on-chip sequencer fetches one fixed-width bundle per cycle and issues every slot in that bundle to a distinct execution unit — the matrix unit, the vector ALUs, the scalar pipe, the memory ports, the sequencer's own control logic — all in the same cycle, with no runtime dependency tracking. Above the hardware bundle sits LLO (the late, TPU-specific compiler IR): a stream of typed instructions, each one opcode + operands + memory-space + predicate, produced by the HLO → MHLO → TLP → tpu → LLO descent and consumed by the bundle packer and the per-generation encoders. The LloOpcodeProto enum carries 462 opcodes and MemorySpaceProto carries 17 memory spaces; together they are the vocabulary the rest of this Part decodes into bits.
The encoding problem has two faces, and this Part keeps them separate. There is the LLVM-MC layer — TPUMCCodeEmitter::getBinaryCodeForInstr (0x13c74da0) lowering one MCInst into a 239-bit APInt record via a per-opcode base-bits table and insertBits holes — which actually carries bits only for the BarnaCore-Pxc (Pufferfish) HwMode. And there is the proto-bundle layer — the per-generation Encoder<gen>::EncodeBundleInternal and the IsaEmitter EmitX / <Slot>Encoder::Encode path — which encodes every TensorCore and V5+ opcode the MC layer returns all-zero for. A reimplementer needs to know which path owns which opcodes; the MC-Emitter page draws that line precisely (4956 of 5667 MC opcodes route to the zero-base default).
This page is navigational. It orients the reader in the VLIW model, names the per-generation bundle widths (41 B Jellyfish → 51 B Pufferfish → 64 B Viperfish / Ghostlite / 6acc60406), sketches the slot taxonomy, and then routes to the page that owns each piece. The deep mechanics live elsewhere: the bundle model for the slot/issue semantics, the per-slot pages for bit layouts, the per-gen bundle pages for full slot maps, and the record format / MC-emitter / InstBits DB trio for the LLVM-MC wire path.
For reimplementation, the contract is:
- The two-level encoding split: LLO IR (
LloOpcodeProto, 462 opcodes) above, and the fixed-width per-gen VLIW bundle below — and the two distinct encoders (MCinsertBitsrecord vs proto-bundleEmitX). - The VLIW slot model: which engine each slot drives and the simultaneous-issue / no-forwarding rule the compiler must respect.
- The six-generation
TpuVersionaxis (0..5) and that the bundle width and slot set are selected per version, not byte-extended.
| LLO IR enum | LloOpcodeProto — 462 opcodes — LloOpcode Enum |
| Memory spaces | MemorySpaceProto — 17 values — MemorySpace Enum |
| MC record | llvm::APInt, 239 bits, 4 words — Record Format |
| MC emitter | TPUMCCodeEmitter::getBinaryCodeForInstr @ 0x13c74da0 — MC-Emitter |
| Bundle widths | 41 B (Jellyfish) / 51 B (Pufferfish) / 64 B (Viperfish, Ghostlite, 6acc60406) — Bundle Model |
TpuVersion | 6 values (TpuVersionToString @ 0x20b3a480 traps on ≥ 6) |
| Proto-bundle encode | Encoder<gen>::EncodeBundleInternal (Jellyfish @ 0x1e86c7c0) |
The Two Encoding Levels
LLO is the compiler IR; the bundle is the wire form. The split is real in the binary and a reimplementer must not conflate them.
- Level 1 — LLO instruction stream. A program is a sequence of
LloInstructions, each(opcode, operand-list, memory-space, predicate). The opcode is one of the 462LloOpcodeProtovalues; the memory space is one of the 17MemorySpaceProtovalues. This is what the optimizer and scheduler manipulate. See LloOpcode Enum and LloOpcode ↔ Proto. - Level 2 — VLIW bundle word. The bundle packer groups mutually independent LLO ops into a
Bundlewhose typed sub-fields map onto the hardware slots, and the per-generation encoder serializes that into a fixed-width byte buffer. See Bundle Model.
Within Level 2 there are two encoders, and which one carries an opcode's bits depends on the generation:
| Encoder | Owns | Page |
|---|---|---|
LLVM-MC insertBits record | BarnaCore-Pxc (Pufferfish) lanes + native ops only | MC-Emitter, Record Format |
proto-bundle EmitX / <Slot>Encoder::Encode | every TensorCore + Viperfish / Ghostlite / 6acc60406 opcode | IsaEmitter Registry, V5+ EmitX Bit Positions |
NOTE — the MC
getBinaryCodeForInstrreturns an all-zero 239-bit record for the overwhelming majority of opcodes (4956 of 5667). That is not a bug or a stub — those opcodes are encoded by the proto-bundle path. A reimplementation that treats the MC layer as the sole encoder will emit all-zero bundles for every V5+ instruction. See MC-Emitter.
The VLIW Bundle and Its Slots
The bundle is a fixed-width VLIW word issued in one cycle; the compiler proves slot independence because the hardware does not. The width is fixed per generation and selected by a (TpuVersion, TpuSequencerType) codec-metadata lookup. The codename ↔ external-name mapping below is the one the Codename Matrix pins from the TpuVersionToString / TpuVersionToExternalName pair; 6acc60406 (TpuVersion 5) is the binary's literal codename, not the marketing name (Trillium/Ironwood appear zero times in libtpu.so).
TpuVersion | Codename | External name | Bundle bytes | Bundle bits |
|---|---|---|---|---|
| 0 | Jellyfish | TPU v2 | 41 | 328 |
| 1 | Dragonfish | TPU v3 | 41 | 328 |
| 2 | Pufferfish | TPU v4 | 51 | 408 |
| 3 | Viperfish | TPU v5e | 64 | 512 |
| 4 | Ghostlite | TPU v6 lite | 64 | 512 |
| 5 | 6acc60406 | TPU7x | 64 | 512 |
The 41-byte Jellyfish width is the hardest-pinned: it is the literal operator new(0x29) (= 41) allocation inside EncoderJf::EncodeBundleInternal (0x1e86c7c0), not a metadata read. The 51-byte and 64-byte widths are computed at runtime — EncoderPfTensorCore::BundleSizeBytes returns 51 inline, while the v5+ codecs reach 64 through a (TpuVersion, TpuSequencerType) vtable call (and the SparseCore overlayer's GetTileInstructionBundleSizeInBytes derives a per-tile size as field[32] / field[31]). The full byte-source accounting per generation is on the Bundle Model page.
The slots partition across the execution units. Each slot class is a typed sub-instruction in the compiler-side Bundle object and has its own page:
| Engine | Slot page(s) |
|---|---|
| Matrix unit (systolic MXU) | MXU Slot, Matprep / IAR / Latch, ResultFifo & ArchRegister |
| Vector ALU (VPU) | VPU Slot |
| Scalar pipe (SPU) | SPU / Scalar Slot |
| Sequencer (control flow / sync) | Sequencer Slot, Sequencer Ops Per Gen |
| Memory ports | Memory-Load, Memory-Store, cmem_load (Pufferfish) |
| Predicate / loop / immediate | Predicate, Hardware Loop-Counter, Immediate |
| Extended unit (transcendentals) | EUP / Transcendental Slot |
| Mask / M-register | vcreate_mask / M-Register |
| SparseCore (v5+) | Sparsity Slot |
The full per-generation slot maps — which slots exist and at what byte offsets — are on the per-gen pages: Jellyfish 41-B, Dragonfish, Pufferfish 51-B, Viperfish 64-B, Ghostlite, 6acc60406. The simultaneous-issue and empty-slot (kNeverExecute) semantics that bind them are on the Bundle Model page.
How LLO Packs Into Bundles
The path from an LLO op to its bits has a fixed shape:
LloInstruction (opcode + operands + memory-space + predicate)
│ bundle packer — group independent ops into typed Bundle slots
▼
Bundle (ScalarInstruction, VectorAluInstruction, VectorExtendedInstruction,
VectorLoadInstruction, VectorStoreInstruction, VectorResultInstruction,
MiscInstruction, + HardwareBundleBits header)
│ Encoder<gen>::EncodeBundleInternal — write each present slot at its byte offset
▼
N-byte bundle word (41 / 51 / 64, per TpuVersion)
The packer is the scheduler's responsibility (LLO Bundle Packing); the serialization is the encoder's. For the BarnaCore-Pxc path the per-instruction bits additionally pass through the 239-bit MC record; for everything else the IsaEmitter writes the bytes directly. The MXU is a two-phase exception: matrix pushes enter the systolic array via the EUP / latch path and results are read back cycles later through the result-FIFO slot — see MXU Slot.
How This Part Is Organized
The pages group into five bands:
- Foundations — this page, the LloOpcode Enum, the MemorySpace Enum, and the Bundle Model.
- MC wire path — InstBits Master DB, TPUInstrNameData / Descs / RegEncoding, LloOpcode ↔ Proto, MC-Emitter, 239-Bit Record Format, TPUMCImm / SyImm32, ArchRegno Numbering.
- Per-generation bundles — Jellyfish, Dragonfish, Pufferfish, Viperfish, Ghostlite, 6acc60406.
- Per-slot encodings — the MXU / VPU / SPU / sequencer / memory / predicate / loop / immediate / EUP / matprep / mask / cmem / sparsity slot pages linked above, plus V5+ EmitX Bit Positions.
- Encode / decode support — IsaEmitter Registry, Decode-Side JF/PF, Decode-Side VF/GXC, NOP / Unused-Slot Canonical Encoding, kIsaTable Data Sections, ResultFifo & ArchRegister, Bias-Add & Quant/Dequant, XLU Op Roster, Pack/Unpack Precision.
The per-generation silicon families themselves (cost models, sub-core taxonomy, address-space IDs) live in the targets Part — start at Targets Overview.
Cross-References
- Bundle Model — the VLIW bundle, slot taxonomy, and simultaneous-issue semantics this page summarizes.
- LloOpcode Enum — the 462-value
LloOpcodeProtoinstruction vocabulary. - InstBits Master DB — the base-bits, descriptor, name, and register-encoding tables the MC emitter reads.
- MC-Emitter —
getBinaryCodeForInstr, the per-opcode dispatch, and the MC-vs-proto-bundle ownership line. - Targets Overview — the per-generation silicon families, cost models, and sub-core taxonomy.