Overview

Every offset, value, and address on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). Other versions differ.

Abstract

Part VI documents the TPU TensorCore instruction set and how the compiler encodes it. The TensorCore is a statically-scheduled VLIW machine: its on-chip sequencer fetches one fixed-width bundle per cycle and issues every slot in that bundle to a distinct execution unit — the matrix unit, the vector ALUs, the scalar pipe, the memory ports, the sequencer's own control logic — all in the same cycle, with no runtime dependency tracking. Above the hardware bundle sits LLO (the late, TPU-specific compiler IR): a stream of typed instructions, each one opcode + operands + memory-space + predicate, produced by the HLO → MHLO → TLP → tpu → LLO descent and consumed by the bundle packer and the per-generation encoders. The LloOpcodeProto enum carries 462 opcodes and MemorySpaceProto carries 17 memory spaces; together they are the vocabulary the rest of this Part decodes into bits.

The encoding problem has two faces, and this Part keeps them separate. There is the LLVM-MC layer — TPUMCCodeEmitter::getBinaryCodeForInstr (0x13c74da0) lowering one MCInst into a 239-bit APInt record via a per-opcode base-bits table and insertBits holes — which actually carries bits only for the BarnaCore-Pxc (Pufferfish) HwMode. And there is the proto-bundle layer — the per-generation Encoder<gen>::EncodeBundleInternal and the IsaEmitter EmitX / <Slot>Encoder::Encode path — which encodes every TensorCore and V5+ opcode the MC layer returns all-zero for. A reimplementer needs to know which path owns which opcodes; the MC-Emitter page draws that line precisely (4956 of 5667 MC opcodes route to the zero-base default).

This page is navigational. It orients the reader in the VLIW model, names the per-generation bundle widths (41 B Jellyfish → 51 B Pufferfish → 64 B Viperfish / Ghostlite / 6acc60406), sketches the slot taxonomy, and then routes to the page that owns each piece. The deep mechanics live elsewhere: the bundle model for the slot/issue semantics, the per-slot pages for bit layouts, the per-gen bundle pages for full slot maps, and the record format / MC-emitter / InstBits DB trio for the LLVM-MC wire path.

For reimplementation, the contract is:

The two-level encoding split: LLO IR (LloOpcodeProto, 462 opcodes) above, and the fixed-width per-gen VLIW bundle below — and the two distinct encoders (MC insertBits record vs proto-bundle EmitX).
The VLIW slot model: which engine each slot drives and the simultaneous-issue / no-forwarding rule the compiler must respect.
The six-generation TpuVersion axis (0..5) and that the bundle width and slot set are selected per version, not byte-extended.


LLO IR enum	`LloOpcodeProto` — 462 opcodes — LloOpcode Enum
Memory spaces	`MemorySpaceProto` — 17 values — MemorySpace Enum
MC record	`llvm::APInt`, 239 bits, 4 words — Record Format
MC emitter	`TPUMCCodeEmitter::getBinaryCodeForInstr` @ `0x13c74da0` — MC-Emitter
Bundle widths	41 B (Jellyfish) / 51 B (Pufferfish) / 64 B (Viperfish, Ghostlite, 6acc60406) — Bundle Model
`TpuVersion`	6 values (`TpuVersionToString` @ `0x20b3a480` traps on `≥ 6`)
Proto-bundle encode	`Encoder<gen>::EncodeBundleInternal` (Jellyfish @ `0x1e86c7c0`)

The Two Encoding Levels

LLO is the compiler IR; the bundle is the wire form. The split is real in the binary and a reimplementer must not conflate them.

Level 1 — LLO instruction stream. A program is a sequence of LloInstructions, each (opcode, operand-list, memory-space, predicate). The opcode is one of the 462 LloOpcodeProto values; the memory space is one of the 17 MemorySpaceProto values. This is what the optimizer and scheduler manipulate. See LloOpcode Enum and LloOpcode ↔ Proto.
Level 2 — VLIW bundle word. The bundle packer groups mutually independent LLO ops into a Bundle whose typed sub-fields map onto the hardware slots, and the per-generation encoder serializes that into a fixed-width byte buffer. See Bundle Model.

Within Level 2 there are two encoders, and which one carries an opcode's bits depends on the generation:

Encoder	Owns	Page
LLVM-MC `insertBits` record	BarnaCore-Pxc (Pufferfish) lanes + native ops only	MC-Emitter, Record Format
proto-bundle `EmitX` / `<Slot>Encoder::Encode`	every TensorCore + Viperfish / Ghostlite / 6acc60406 opcode	IsaEmitter Registry, V5+ EmitX Bit Positions

NOTE — the MC getBinaryCodeForInstr returns an all-zero 239-bit record for the overwhelming majority of opcodes (4956 of 5667). That is not a bug or a stub — those opcodes are encoded by the proto-bundle path. A reimplementation that treats the MC layer as the sole encoder will emit all-zero bundles for every V5+ instruction. See MC-Emitter.

The VLIW Bundle and Its Slots

The bundle is a fixed-width VLIW word issued in one cycle; the compiler proves slot independence because the hardware does not. The width is fixed per generation and selected by a (TpuVersion, TpuSequencerType) codec-metadata lookup. The codename ↔ external-name mapping below is the one the Codename Matrix pins from the TpuVersionToString / TpuVersionToExternalName pair; 6acc60406 (TpuVersion 5) is the binary's literal codename, not the marketing name (Trillium/Ironwood appear zero times in libtpu.so).

`TpuVersion`	Codename	External name	Bundle bytes	Bundle bits
0	Jellyfish	TPU v2	41	328
1	Dragonfish	TPU v3	41	328
2	Pufferfish	TPU v4	51	408
3	Viperfish	TPU v5e	64	512
4	Ghostlite	TPU v6 lite	64	512
5	6acc60406	TPU7x	64	512

The 41-byte Jellyfish width is the hardest-pinned: it is the literal operator new(0x29) (= 41) allocation inside EncoderJf::EncodeBundleInternal (0x1e86c7c0), not a metadata read. The 51-byte and 64-byte widths are computed at runtime — EncoderPfTensorCore::BundleSizeBytes returns 51 inline, while the v5+ codecs reach 64 through a (TpuVersion, TpuSequencerType) vtable call (and the SparseCore overlayer's GetTileInstructionBundleSizeInBytes derives a per-tile size as field[32] / field[31]). The full byte-source accounting per generation is on the Bundle Model page.

The slots partition across the execution units. Each slot class is a typed sub-instruction in the compiler-side Bundle object and has its own page:

Engine	Slot page(s)
Matrix unit (systolic MXU)	MXU Slot, Matprep / IAR / Latch, ResultFifo & ArchRegister
Vector ALU (VPU)	VPU Slot
Scalar pipe (SPU)	SPU / Scalar Slot
Sequencer (control flow / sync)	Sequencer Slot, Sequencer Ops Per Gen
Memory ports	Memory-Load, Memory-Store, cmem_load (Pufferfish)
Predicate / loop / immediate	Predicate, Hardware Loop-Counter, Immediate
Extended unit (transcendentals)	EUP / Transcendental Slot
Mask / M-register	vcreate_mask / M-Register
SparseCore (v5+)	Sparsity Slot

The full per-generation slot maps — which slots exist and at what byte offsets — are on the per-gen pages: Jellyfish 41-B, Dragonfish, Pufferfish 51-B, Viperfish 64-B, Ghostlite, 6acc60406. The simultaneous-issue and empty-slot (kNeverExecute) semantics that bind them are on the Bundle Model page.

How LLO Packs Into Bundles

The path from an LLO op to its bits has a fixed shape:

LloInstruction  (opcode + operands + memory-space + predicate)
   │  bundle packer — group independent ops into typed Bundle slots
   ▼
Bundle  (ScalarInstruction, VectorAluInstruction, VectorExtendedInstruction,
         VectorLoadInstruction, VectorStoreInstruction, VectorResultInstruction,
         MiscInstruction, + HardwareBundleBits header)
   │  Encoder<gen>::EncodeBundleInternal — write each present slot at its byte offset
   ▼
N-byte bundle word   (41 / 51 / 64, per TpuVersion)

The packer is the scheduler's responsibility (LLO Bundle Packing); the serialization is the encoder's. For the BarnaCore-Pxc path the per-instruction bits additionally pass through the 239-bit MC record; for everything else the IsaEmitter writes the bytes directly. The MXU is a two-phase exception: matrix pushes enter the systolic array via the EUP / latch path and results are read back cycles later through the result-FIFO slot — see MXU Slot.

How This Part Is Organized

The pages group into five bands:

Foundations — this page, the LloOpcode Enum, the MemorySpace Enum, and the Bundle Model.
MC wire path — InstBits Master DB, TPUInstrNameData / Descs / RegEncoding, LloOpcode ↔ Proto, MC-Emitter, 239-Bit Record Format, TPUMCImm / SyImm32, ArchRegno Numbering.
Per-generation bundles — Jellyfish, Dragonfish, Pufferfish, Viperfish, Ghostlite, 6acc60406.
Per-slot encodings — the MXU / VPU / SPU / sequencer / memory / predicate / loop / immediate / EUP / matprep / mask / cmem / sparsity slot pages linked above, plus V5+ EmitX Bit Positions.
Encode / decode support — IsaEmitter Registry, Decode-Side JF/PF, Decode-Side VF/GXC, NOP / Unused-Slot Canonical Encoding, kIsaTable Data Sections, ResultFifo & ArchRegister, Bias-Add & Quant/Dequant, XLU Op Roster, Pack/Unpack Precision.

The per-generation silicon families themselves (cost models, sub-core taxonomy, address-space IDs) live in the targets Part — start at Targets Overview.

Cross-References

Bundle Model — the VLIW bundle, slot taxonomy, and simultaneous-issue semantics this page summarizes.
LloOpcode Enum — the 462-value LloOpcodeProto instruction vocabulary.
InstBits Master DB — the base-bits, descriptor, name, and register-encoding tables the MC emitter reads.
MC-Emitter — getBinaryCodeForInstr, the per-opcode dispatch, and the MC-vs-proto-bundle ownership line.
Targets Overview — the per-generation silicon families, cost models, and sub-core taxonomy.

Keyboard shortcuts