Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Overview

Every offset, value, and address on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). Other versions differ.

Abstract

Part VI documents the TPU TensorCore instruction set and how the compiler encodes it. The TensorCore is a statically-scheduled VLIW machine: its on-chip sequencer fetches one fixed-width bundle per cycle and issues every slot in that bundle to a distinct execution unit — the matrix unit, the vector ALUs, the scalar pipe, the memory ports, the sequencer's own control logic — all in the same cycle, with no runtime dependency tracking. Above the hardware bundle sits LLO (the late, TPU-specific compiler IR): a stream of typed instructions, each one opcode + operands + memory-space + predicate, produced by the HLO → MHLO → TLP → tpu → LLO descent and consumed by the bundle packer and the per-generation encoders. The LloOpcodeProto enum carries 462 opcodes and MemorySpaceProto carries 17 memory spaces; together they are the vocabulary the rest of this Part decodes into bits.

The encoding problem has two faces, and this Part keeps them separate. There is the LLVM-MC layerTPUMCCodeEmitter::getBinaryCodeForInstr (0x13c74da0) lowering one MCInst into a 239-bit APInt record via a per-opcode base-bits table and insertBits holes — which actually carries bits only for the BarnaCore-Pxc (Pufferfish) HwMode. And there is the proto-bundle layer — the per-generation Encoder<gen>::EncodeBundleInternal and the IsaEmitter EmitX / <Slot>Encoder::Encode path — which encodes every TensorCore and V5+ opcode the MC layer returns all-zero for. A reimplementer needs to know which path owns which opcodes; the MC-Emitter page draws that line precisely (4956 of 5667 MC opcodes route to the zero-base default).

This page is navigational. It orients the reader in the VLIW model, names the per-generation bundle widths (41 B Jellyfish → 51 B Pufferfish → 64 B Viperfish / Ghostlite / 6acc60406), sketches the slot taxonomy, and then routes to the page that owns each piece. The deep mechanics live elsewhere: the bundle model for the slot/issue semantics, the per-slot pages for bit layouts, the per-gen bundle pages for full slot maps, and the record format / MC-emitter / InstBits DB trio for the LLVM-MC wire path.

For reimplementation, the contract is:

  • The two-level encoding split: LLO IR (LloOpcodeProto, 462 opcodes) above, and the fixed-width per-gen VLIW bundle below — and the two distinct encoders (MC insertBits record vs proto-bundle EmitX).
  • The VLIW slot model: which engine each slot drives and the simultaneous-issue / no-forwarding rule the compiler must respect.
  • The six-generation TpuVersion axis (0..5) and that the bundle width and slot set are selected per version, not byte-extended.
LLO IR enumLloOpcodeProto — 462 opcodes — LloOpcode Enum
Memory spacesMemorySpaceProto — 17 values — MemorySpace Enum
MC recordllvm::APInt, 239 bits, 4 words — Record Format
MC emitterTPUMCCodeEmitter::getBinaryCodeForInstr @ 0x13c74da0MC-Emitter
Bundle widths41 B (Jellyfish) / 51 B (Pufferfish) / 64 B (Viperfish, Ghostlite, 6acc60406) — Bundle Model
TpuVersion6 values (TpuVersionToString @ 0x20b3a480 traps on ≥ 6)
Proto-bundle encodeEncoder<gen>::EncodeBundleInternal (Jellyfish @ 0x1e86c7c0)

The Two Encoding Levels

LLO is the compiler IR; the bundle is the wire form. The split is real in the binary and a reimplementer must not conflate them.

  • Level 1 — LLO instruction stream. A program is a sequence of LloInstructions, each (opcode, operand-list, memory-space, predicate). The opcode is one of the 462 LloOpcodeProto values; the memory space is one of the 17 MemorySpaceProto values. This is what the optimizer and scheduler manipulate. See LloOpcode Enum and LloOpcode ↔ Proto.
  • Level 2 — VLIW bundle word. The bundle packer groups mutually independent LLO ops into a Bundle whose typed sub-fields map onto the hardware slots, and the per-generation encoder serializes that into a fixed-width byte buffer. See Bundle Model.

Within Level 2 there are two encoders, and which one carries an opcode's bits depends on the generation:

EncoderOwnsPage
LLVM-MC insertBits recordBarnaCore-Pxc (Pufferfish) lanes + native ops onlyMC-Emitter, Record Format
proto-bundle EmitX / <Slot>Encoder::Encodeevery TensorCore + Viperfish / Ghostlite / 6acc60406 opcodeIsaEmitter Registry, V5+ EmitX Bit Positions

NOTE — the MC getBinaryCodeForInstr returns an all-zero 239-bit record for the overwhelming majority of opcodes (4956 of 5667). That is not a bug or a stub — those opcodes are encoded by the proto-bundle path. A reimplementation that treats the MC layer as the sole encoder will emit all-zero bundles for every V5+ instruction. See MC-Emitter.


The VLIW Bundle and Its Slots

The bundle is a fixed-width VLIW word issued in one cycle; the compiler proves slot independence because the hardware does not. The width is fixed per generation and selected by a (TpuVersion, TpuSequencerType) codec-metadata lookup. The codename ↔ external-name mapping below is the one the Codename Matrix pins from the TpuVersionToString / TpuVersionToExternalName pair; 6acc60406 (TpuVersion 5) is the binary's literal codename, not the marketing name (Trillium/Ironwood appear zero times in libtpu.so).

TpuVersionCodenameExternal nameBundle bytesBundle bits
0JellyfishTPU v241328
1DragonfishTPU v341328
2PufferfishTPU v451408
3ViperfishTPU v5e64512
4GhostliteTPU v6 lite64512
56acc60406TPU7x64512

The 41-byte Jellyfish width is the hardest-pinned: it is the literal operator new(0x29) (= 41) allocation inside EncoderJf::EncodeBundleInternal (0x1e86c7c0), not a metadata read. The 51-byte and 64-byte widths are computed at runtime — EncoderPfTensorCore::BundleSizeBytes returns 51 inline, while the v5+ codecs reach 64 through a (TpuVersion, TpuSequencerType) vtable call (and the SparseCore overlayer's GetTileInstructionBundleSizeInBytes derives a per-tile size as field[32] / field[31]). The full byte-source accounting per generation is on the Bundle Model page.

The slots partition across the execution units. Each slot class is a typed sub-instruction in the compiler-side Bundle object and has its own page:

EngineSlot page(s)
Matrix unit (systolic MXU)MXU Slot, Matprep / IAR / Latch, ResultFifo & ArchRegister
Vector ALU (VPU)VPU Slot
Scalar pipe (SPU)SPU / Scalar Slot
Sequencer (control flow / sync)Sequencer Slot, Sequencer Ops Per Gen
Memory portsMemory-Load, Memory-Store, cmem_load (Pufferfish)
Predicate / loop / immediatePredicate, Hardware Loop-Counter, Immediate
Extended unit (transcendentals)EUP / Transcendental Slot
Mask / M-registervcreate_mask / M-Register
SparseCore (v5+)Sparsity Slot

The full per-generation slot maps — which slots exist and at what byte offsets — are on the per-gen pages: Jellyfish 41-B, Dragonfish, Pufferfish 51-B, Viperfish 64-B, Ghostlite, 6acc60406. The simultaneous-issue and empty-slot (kNeverExecute) semantics that bind them are on the Bundle Model page.


How LLO Packs Into Bundles

The path from an LLO op to its bits has a fixed shape:

LloInstruction  (opcode + operands + memory-space + predicate)
   │  bundle packer — group independent ops into typed Bundle slots
   ▼
Bundle  (ScalarInstruction, VectorAluInstruction, VectorExtendedInstruction,
         VectorLoadInstruction, VectorStoreInstruction, VectorResultInstruction,
         MiscInstruction, + HardwareBundleBits header)
   │  Encoder<gen>::EncodeBundleInternal — write each present slot at its byte offset
   ▼
N-byte bundle word   (41 / 51 / 64, per TpuVersion)

The packer is the scheduler's responsibility (LLO Bundle Packing); the serialization is the encoder's. For the BarnaCore-Pxc path the per-instruction bits additionally pass through the 239-bit MC record; for everything else the IsaEmitter writes the bytes directly. The MXU is a two-phase exception: matrix pushes enter the systolic array via the EUP / latch path and results are read back cycles later through the result-FIFO slot — see MXU Slot.


How This Part Is Organized

The pages group into five bands:

The per-generation silicon families themselves (cost models, sub-core taxonomy, address-space IDs) live in the targets Part — start at Targets Overview.


Cross-References

  • Bundle Model — the VLIW bundle, slot taxonomy, and simultaneous-issue semantics this page summarizes.
  • LloOpcode Enum — the 462-value LloOpcodeProto instruction vocabulary.
  • InstBits Master DB — the base-bits, descriptor, name, and register-encoding tables the MC emitter reads.
  • MC-EmittergetBinaryCodeForInstr, the per-opcode dispatch, and the MC-vs-proto-bundle ownership line.
  • Targets Overview — the per-generation silicon families, cost models, and sub-core taxonomy.