Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

LLO Opcode Master Table

Every opcode value, mnemonic, slot name, and address on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID 89edbbe81c5b328a958fe628a9f2207d, not stripped). Other versions differ.

Abstract

This appendix is the single consolidated reference table for the LLO opcode space — the one table a reimplementer keeps open while building a TPU back end. LLO (Low-Level Optimizer IR) is the TPU-specific late compiler IR below MHLO/TLP and above the per-generation TensorCore bundle encoders; its in-memory opcode enum xla::jellyfish::LloOpcode is a dense, zero-based, 461-value numbering (0x000..0x1CC), and every value is named in the relocated opcode_name table at 0x21ccfef0. The authoritative per-family facts live on the isa/ deep pages; this page aggregates them into one master opcode→mnemonic→slot→semantics→per-gen table, grouped by the bundle slot (functional unit) each opcode is encoded into.

The table is organized by slot family, not by numeric run, because the value space interleaves: LloOpcodeIsVector is a dense switch, not a >= threshold test, and (for example) kVectorReadIar (1) is vector while kLog (5) is not. Each slot section below enumerates the opcodes that the per-slot encoder serializes into that slot, with the numeric LloOpcode value, a one-line semantics, and per-generation availability where it varies (the five codec lineages: jxc Jellyfish/Dragonfish v2/v3, pxc Pufferfish v4, vfc/vxc Viperfish v5p, glc Ghostlite v6e, gfc 6acc60406 TPU7x). The bit-level field encodings — where each opcode's operands land in the 41/51/64-byte bundle — are not repeated here; they live on the per-slot deep pages, which every section cross-links.

Three index spaces must not be conflated with LloOpcode, and a reimplementer who confuses them mis-decodes every program: the LloOpcodeProto wire enum (1-based, max 499, 38 reserved gaps, see LloOpcode↔Proto); the MC MCInst opcode space the LLVM-MC emitter dispatches on (LloOpcode + 499, gated <= 0x1F2); and the GhPerf::Instruction cost-grid enum. LloOpcode is the one all the others map from. A representative sample of the values in this table was re-verified against the binary classifiers and converters (see Verification); where a deep page and the binary agreed they were taken as CONFIRMED, and no disagreements were found.

Enumxla::jellyfish::LloOpcode — 461 enumerators, dense 0x000..0x1CC (0..460)
Name accessorLloOpcodeName(LloOpcode) @ 0x1d631280 — bound >= 0x1CDud1
Name tableopcode_name @ 0x21ccfef0 (461 × char*, R_X86_64_RELATIVE)
Property wordopcode_info @ 0x223a1320 (461 × uint16: Push/Pop/Remat/Fold/Cse + reg-file class)
Descriptoropcode_info_big @ 0x227b5570 (461 × 28 B: ResultFifo + ArchRegister lists)
Family classifierLloOpcodeIsVector @ 0x1d60c1c0; LloOpcodeIsScalar = !IsVector @ 0x1d60c7e0
Wire converterLloOpcodeToProto @ 0x14420020 (table @ 0x344cb4c); ProtoToLloOpcode @ 0x14420040
ConfidenceCONFIRMED (byte-anchored) unless a row says otherwise

For reimplementation, the contract this table supports is:

  • The opcode numbering is gen-invariant — the same LloOpcode value means the same opcode on every generation; what changes per gen is the valid subset (sparsity is v5+, cmem-load is PF-only) and the bundle encoding, not the number.
  • The slot a value occupies (which bundle field the per-gen encoder serializes it into) is recovered from the per-slot classifier switches and encoder dispatch, listed per section.
  • The per-gen availability column flags opcodes that exist only from a given generation (F8 converts PF+, stochastic-round VF+, S4/U4 matmul PF+, BarnaCore vs SparseCore split).
  • The bit encodings are deferred to the per-slot deep pages; this table is value↔mnemonic↔slot↔semantics only.

Slot Families at a Glance

Eleven functional-unit families partition the 461 opcodes (plus a 33-opcode BarnaCore/SparseCore block). The count column is the number of distinct LloOpcode values the family owns; the owning page is the deep page that documents that slot's bit encoding. Values are mostly-contiguous bands; the exact members are in the per-family sections below.

Slot familyValue band(s)OpcodesOwning deep page
Sequencer / sync / FIFO transfer0x000..0x030 (interleaved)~50Sequencer Slot · Seq-Ops-Per-Gen
SPU / scalar ALU & control0x085..0x089, 0x16B..0x1AA63SPU / Scalar Slot
VPU / vector ALU & logic0x048..0x05A, 0x11B..0x1A2(subset of 295 vector)VPU Slot
Convert / pack / unpack0x05B..0x076, 0x107..0x11A, 0x125..0x127~45VPU Slot · EUP
EUP transcendental0x128..0x14D38EUP / Transcendental Slot
Cross-lane / reduce / XLU result0x036..0x03B, 0x0F5..0x101, 0x14E..0x155~28ResultFifo / ArchRegister
MXU matmul / latch / matprep / matres0x08D..0x0AB, 0x152..0x153~33MXU Slot · Matprep/IAR/Latch
Memory load / store / IAR / RNG0x001..0x004, 0x030..0x046, 0x077..0x078~30Memory-Load · Memory-Store · CMEM-Load
DMA0x0B3..0x0DA (contiguous)40Sequencer Slot (DMA-issue)
Predicate / mask0x0E1..0x0F1, 0x167..0x16A, 0x193..0x199~18Predicate Slot · vcreate_mask
Constants / pseudo / call0x02C..0x02E, 0x0DB..0x0F4, 0x17C~25Immediate Slot
BarnaCore (SparseCore)0x1AC..0x1CC (contiguous)33LLO Opcode Enum
Sparsity (v5+) structured-sparse(open)Sparsity Slot (v5+)

NOTE — the sparsity slot is not yet recovered. The structured-sparsity slot (Sparsity Slot (v5+)) is a v5+/Viperfish-and-later feature whose backing analysis is still open; no distinct LloOpcode values for it have been pinned. It is listed for completeness but contributes no rows to the tables below. A reimplementer targeting only v3/v4 can ignore it entirely.

QUIRK — 461, not 462. The "462" figure that floats around LLO documentation is the LloOpcodeProto wire enum (1-based, value-0 sentinel, max 499, 38 gaps). The in-memory LloOpcode this table indexes is exactly 461 dense values (0x000..0x1CC); LloOpcodeName traps on >= 0x1CD. Drive a switch off 462 and the last index reads past the metadata tables. See LloOpcode↔Proto.


Sequencer, Sync & FIFO-Transfer Slot (0x000..0x030)

The control-and-handshake layer: program boundaries, scheduling barriers, fences, cycle-counter reads, halt/yield, and the scalar/vector/sync-flag result-FIFO push/pop primitives that move values between the SPU, VPU, and the cross-FIFO staging registers. PC mutation (branch/call/halt) is lane-0-only; sync ops live in a separate ScalarMisc lane on v5+. The per-(gen × sequencer-type) control-flow roster is on Seq-Ops-Per-Gen.

LloOpcodeMnemonicSlot / laneSemanticsPer-gen
0x000kEventsequencerprogram / trace event markerall
0x005kLogsequencerdebug-log marker (non-vector)all
0x006/0x007kHloStart / kHloEndsequencersource-HLO span markers (debug-info)all
0x008kSchedulingBarriersequencerhard reorder barrier (vetoes CSE across it)all
0x009kScalarToVectorV↔S bridgescalar→vector broadcast pushall
0x00AkVectorToScalarPushV→S FIFOvector→scalar FIFO push (Push bit)all (v6e GhPerf 0x1DA)
0x00BkSyncFlagToScalarPushsync→S FIFOsync-flag→scalar FIFO pushall
0x00A..0x012kVectorToScalarPushkDrfPopV↔S / sync FIFOV→S / sync-flag→scalar push/pop quartetall
0x011/0x012kSfrfPop / kDrfPopFIFO popsync-flag-result / divrem-result FIFO pop (Pop bit)all
0x013/0x014kScalarFence / kVectorStoreFencesequencer / miscordering fences (store-fence → misc slot on gfc)all
0x015..0x01CkScalarCcfPushkVectorCcfPopAsymmetricalCCF FIFOcross-core FIFO push/pop (symmetric + asymmetric)all
0x01DkMegacoreSwapCoresPseudosequencermegacore core-swap pseudoall
0x01EkCmemFencesequencerCMEM ordering fencePF+
0x01F..0x024kScalarReadCycleStartkScalarReadCycleLowsequencercycle-counter reads / 64-bit lo/hi splitsall
0x025..0x027kScalarHalt / …YieldConditional / …OnErrorsequencer (lane 0)sequencer halt variants (GhPerf row 0x000)all
0x029kProgramLaunchScsequencerSparseCore program launchall
0x02BkVectorInterruptsequencervector-side interruptall

NOTE — Push/Pop are property bits, not opcode ranges. opcode_info bit0 (Push) is set on the *Push members, bit1 (Pop) on the *Pop members. A scheduler models the V2S / sync FIFOs by reading these bits, not by a numeric range test. kScalarHalt/kScalarHaltOnError share GhPerf cost row 0x000.


SPU / Scalar-ALU Slot

The scalar control CPU lane — 32-entry SREG file, address arithmetic, branch/call/select, the dense arithmetic-and-shift tail, and the V↔S FIFO pops. The register-file-class byte (opcode_info high byte) is 1 (scalar/mask). Lane 0 owns the full set incl. branch/call; lane 1 is an ALU+halt mirror. Bit layout on SPU / Scalar Slot.

LloOpcodeMnemonicSlotSemantics
0x077/0x078kScalarLoad / kScalarStorescalar laneSMEM↔SREG memory access (lane-1 bound on JF/PF)
0x085kScalarComposeU64scalarpack two u32 into a u64
0x086kScalarAddressCalculationscalaraddress arithmetic (shares GhPerf 0x00C with kScalarAddS32)
0x087/0x088kScalarBranchRel / kScalarBranchIndsequencer (lane 0)relative / indirect branch
0x089kScalarSelectscalarscalar conditional select
0x16B/0x16CkScalarCompare / kScalarAddCarryU32scalarcompare + add-with-carry
0x16D..0x172kScalarMultiplyWordAddrkScalarAddS32scalarmultiplies + adds (u24/u32/f32/s32)
0x173..0x177kScalarSubtractS32kScalarBitwiseXorscalarsubtract + bitwise and/or/xor
0x178..0x17AkScalarDivRemU32 / …U32AndPop / …RemU32AndPopscalardivide/remainder (data-format special-cased)
0x17BkScalarMovescalarscalar copy — Move-exclusion (never CSE'd/rematted)
0x17D..0x17FkScalarFloorF32 / kScalarCeilF32 / kScalarCountLeadingZerosscalarscalar rounding + CLZ
0x1A3..0x1A6kScalarShrlkScalarShllOnesscalarlogical / arithmetic shifts
0x1A7..0x1AAkScalarMinimumF32kScalarMaximumU32scalarmin / max (f32/u32)

The SPU is the complement of the vector set (LloOpcodeIsScalar = !LloOpcodeIsVector), 63 opcodes total; EncodingToScalarRegister (0x1e871e40) bounds the SREG selector at > 0x1F (32 registers, 5-bit field) — verified byte-exact. The per-gen jump-table size is the per-gen scalar ISA size (JF 0..0x3E, PF ≤0x33, VF ≤0x4C, GL/GF ≤0x49).


VPU / Vector-ALU Slot

The per-lane vector ALU over the 8 sublane × 128 lane = 1024-element vreg. Register-file-class byte 2 (vector); the foldable/CSE bits (opcode_info bit5/bit6) are set on the pure-functional members. LloOpcodeIsVectorUnop (0x1d60c200) / LloOpcodeIsVectorBinop (0x1d60c680) split unary vs binary. Bit layout (6/7/8-bit opcode, 5/4/2-bit predicate per gen) on VPU Slot.

LloOpcodeMnemonicSlotSemantics
0x048..0x050kVectorClampGezF32kVectorRemapBf16VALUclamp / remap (gez, symmetric, asymmetric; F32/Bf16/S4)
0x051..0x05AkVectorMultiplyAccumulatekVectorMoveEvenAccLowVALUMAC + accumulator moves (FOLD bit on MAC family)
0x11B..0x124kVectorAddS32kVectorSubtractS16VALUadd / subtract (s32/f32/bf16/s16, Bf16-hi/lo)
0x156..0x15BkVectorPowF32kVectorMultiplyBf16VALUpow + multiply (F32/U32/U16/Bf16)
0x15C..0x15FkVectorAndU32kVectorXorU32VALUbitwise and / and-negated / or / xor
0x162..0x166kVectorMultiplyComposeU64kVectorExtractHigh32VALU64-bit multiply compose + word extracts
0x180..0x184kVectorCountLeadingZeroskVectorExtractSignificandVALUCLZ / move / popcount / FP field extracts
0x181kVectorMoveVALUvector copy — Move-exclusion (never CSE'd/rematted)
0x19A..0x19CkVectorShiftRightLogicalkVectorShiftLeftLogicalVALUvector shifts
0x19D..0x1A2kVectorMaximumF32kVectorMinimumU32VALUmin / max (F32/Bf16/U32)

The VALU is the largest family; only the representative bands are shown. vmul.u32.u64 is a PF+ slot-pair wide multiply. The VectorAluYEncoding (0..31) source-B selector — vreg / VS0-2 ports / hardwired float constants / immediate slots — is documented on VPU Slot §Y-operand.


Convert / Pack / Unpack

Type conversions across three bands: f32→{s32,f8,hf16} and {s8,s4,u8,u4}↔bf16, the unpack/round/truncate block, and pack/compose. These gained the most across generations.

LloOpcodeMnemonicSlotSemanticsPer-gen
0x05B..0x060kScalarConvertF32ToS32WithProbRounding…TowardsZeroPseudoVALU/scalarF32→S32 (prob-round / towards-zero)all
0x061..0x063kVectorConvertF32ToF8E5M2 / …E4M3Fn / …E4M3B11VALUF32→F8PF+
0x064kVectorConvertF32ToHf16VALUF32→Hf16all
0x066..0x06DkVectorConvertS8ToBf16kVectorConvertBf16ToU4VALUint↔Bf16 (s8/s4/u8/u4)S4/U4 = PF+
0x06E/0x06FkVectorConvertEXMYToE4M3 / …ToE5M2VALUgeneric FP8 reformatPF+
0x070..0x074kVectorConvertF32ToE5M2Stochastic…ToHf16StochasticVALUstochastic-rounding convertsVF+
0x107..0x10FkScalarConvertS32ToF32kVectorDynamicUnpackVALU/scalarS32→F32 convert + unpack (kVectorUnpack 0x109) + B2→B4 / B4→B8 join + EXMY/dynamicall
0x110..0x11AkVectorCeilF32kVectorTruncateBf16VALUround-to-int / RTNA / RTNE / truncateall
0x125..0x127kVectorComposeF32 / kVectorPack / kVectorPackEXMYVALUcompose + packall

EUP / Transcendental Slot (0x128..0x14D)

The Extended Unary Pipeline computes nine transcendentals (tanh, pow2, reciprocal, log2, rsqrt, shifted-sigmoid, sinq, cosq, erf) as a deferred-result pipeline: a push (VALU slot 3 only on v5+) and a later pop from the VectorResult slot. Each function appears in four LLO forms — F32 issue-only, Bf16 issue-only, F32 issue+pop (AndPop), Bf16 issue+pop. Selector map and latency on EUP / Transcendental Slot.

BandLloOpcodeMembersForm
Bare push F320x128..0x131Tanh/Pow2/Recip/Log2/Rsqrt/SigShft/Sinq/Cosq/Erf + kVectorPushErfF32 issue-only (Push bit)
Bare push Bf160x132..0x13Asame setBf16 issue-only
Fused F320x13B..0x144same set + kVectorPushErfAndPop (0x144)F32 issue+pop (*AndPop)
Fused Bf160x145..0x14Dsame setBf16 issue+pop
Deferred pop0x14EkVectorEupResultpop EUP FIFO (Pop bit, opcode_info=0x0202)

Representative function-selector values (the 5-bit field, gfc/glc @ bit 183): F32 tanh=0x13, rsqrt=0x10, pow2=0x11, log2=0x12, sigmoid=0x14, reciprocal=0x15, sin=0x17, cos=0x18, erf=0x0e; Bf16 selectors differ (e.g. tanh=0x1b). The 18 *AndPop pseudo-ops (excluding 0x144) are split by LloLateDecomposer into bare push + deferred pop; LloOpcodeIsPseudoEupInstruction (0x1d60c880) classifies them via the 0x7FDFF mask — verified byte-exact.

NOTE — the EUP family is the cleanest Push/Pop illustration. Every issue-only EUP opcode sets opcode_info bit0 (Push); kVectorEupResult (0x14E) sets bit1 (Pop). CreateVectorEupResult (0x1d4d9820) asserts the push operand is in [0x128, 0x13A] before building the 0x14E pop — verified byte-exact.


Cross-Lane, Reduce & XLU-Result

Cross-lane permute/rotate/broadcast, whole/segmented reductions (min/max/add/argmin/argmax in F32 and Bf16), and the deferred-result pops (EUP, cross-lane, permute, CMEM, transpose, matmul). The 21 XLU opcodes are walked by ComputeXluOperations (0x126d9780) and emit {TransposeTile, RpuOperation, XluControlOperation} variants — see ResultFifo / ArchRegister.

LloOpcodeMnemonicSlotSemantics
0x036..0x03BkVectorPermutekVectorBroadcastLaneXLU (load slot)cross-lane permute / rotate / combine / broadcast
0x08B/0x08CkVectorSetPermutePattern / kVectorSetSegmentPatternXLU controlpermute / segment pattern setup
0x0F5..0x0FCkVectorMinReduceF32kVectorAddSegmentReduceF32XLU (reduce)F32 reductions (whole + segmented + index)
0x0FD..0x101kVectorMinReduceBf16kVectorMinIndexReduceBf16XLU (reduce)Bf16 reductions
0x102..0x106kVectorSublaneIdkVectorLaneSequenceInterleavedB16VALUlane/sublane identity sequences (remat-able)
0x14EkVectorEupResultVectorResultpop EUP result FIFO
0x14F..0x151kVectorXlaneResult / kVectorPermuteResult / kVectorCmemResultVectorResultpop XLU / permute / CMEM result FIFOs
0x152/0x153kVectorMatres / kVectorMatresAddVectorResultpop MXU result (plain / accumulate)
0x154/0x155kVectorTransposeResult / kVectorTransposeClearVectorResultpop transpose result / clear transpose FIFO

kVectorXlaneResult/kVectorPermuteResult/kVectorTransposeResult share GhPerf cost row 0x1C7. kVectorTransposeClear (0x155) is the single opcode whose opcode_info word is 0x000E.


MXU — Matmul, Latch, Matprep, Result

The systolic-array op family: latch the stationary weights, matprep the moving operand, issue the matmul, drain via matres. Latch/matprep/matmul share the VectorExtended slot; matres uses VectorResult. The matmul band (0x8D..0xA5) is data-format special-cased before the property-word read. Field encoding (per-gen 6→7→8-bit opcode, the −20/−21/−25 inter-MXU twin) on MXU Slot; latch/matprep field maps on Matprep/IAR/Latch.

LloOpcodeMnemonicSlotSemanticsBuilder / classifier
0x08DkVectorLatchLsfVectorExtendedlatch stationary weights (load-stationary-from-FIFO)CreateVectorLatchLsf @ 0x1d4d7aa0New(141)
0x08EkVectorLatchLsfMskVectorExtendedmasked LSF latch
0x08F..0x096kVectorLatchkVectorLatch3MskVectorExtendedgain-matrix latch (plain / 1..3 sub-bank, masked)CreateVectorLatchHelper
0x097/0x098kVectorMatprepSubr (+Msk)VectorExtendedpush moving operand, sub-row form(op-151)<2 @ 0x1d60c400
0x099/0x09AkVectorMatprepMubr (+Msk)VectorExtendedpush moving operand, block-row form(op-153)<2 @ 0x1d60c3e0
0x09BkVectorMatmulVectorExtendedone systolic stepEmitVectorMatmul @ 0x140b92c0
0x09C/0x0A0kVectorMatmulMubr (+Msk)VectorExtendedconv block-row matmul((op-156)&0xFFFB)==0 @ 0x1d60c3c0
0x09D/0x09EkVectorMatmulHigh / kVectorMatmulLowVectorExtendedhigh-half / low-half accumulator stepEmitVectorMatmul
0x09F..0x0A5kVectorMatmulMskkVectorMatmulLmrVectorExtendedmasked / packed (kVectorMatmulPacked 0x0A3) / LMR-fused matmul
0x0A6/0x0A7kVectorTranspose / kVectorTransposeBinaryVectorExtendedXLU transpose (vxpose-mode dispatched)
0x0A8..0x0ABkVectorDoneWithGainskVectorLoadLmrWithBf16ConversionVectorExtendedgain handshake + GMR/LMR loads
0x152/0x153kVectorMatres / kVectorMatresAddVectorResultmatmul result collection (plain / accumulate)a3 != 338 @ EmitVectorMatres

Per-gen dtype set: PF/VF/GL carry 8 formats {F32, If8, Bf16, Bf8} × float + {U8, S8, U4, S4} × int; gfc (v6e) is float-only 4-format {F32, E4m3, Bf16, E5m2}. MXUs per TensorCore: 1 (JF) / 4 (PF, VF) / 2 (GL, gfc). MSR banks: 1 (JF/PF) / 2 (VF/GL/gfc).

GOTCHA — the matmul push bit is data-format-dependent, not static. LloInstructionPushesToResultFifo tests the matmul band (0x8D..0xA5) via a bitmask and routes to a matmul_data_format vtable call; reading the static Push bit from opcode_info[op] for a matmul opcode gives the wrong FIFO behavior. The push depends on the data format decided at the instruction instance.


Memory Load / Store / IAR / RNG

Per-lane index-address-register (IAR) setup, the per-lane PRNG, and the load/store slots. The tier (VMEM/SMEM/CMEM/SPMEM) is selected by which slot the op occupies, not a tier bit. CMEM-load is a dedicated PF-only slot. Bit layout on Memory-Load, Memory-Store, CMEM-Load (PF).

LloOpcodeMnemonicSlotSemanticsPer-gen
0x001kVectorReadIarload slotread index-address register into a vregall
0x002..0x004kVectorSetIarLane / …Raw / …Sublanestore slotset IAR (lane/raw/sublane; LLO≠slot opcode)all
0x030..0x035kVectorLoadSublaneShufflekVectorCmemLoadAndPopload slotvector + CMEM loadsall (CMEM = PF)
0x03C..0x03EkVectorPrng / kVectorSetRngSeed / kVectorGetRngSeedVALU/storeper-lane PRNG (seeds stochastic converts)all
0x03F..0x046kVectorStorekVectorStoreEvenOddSublanesstore slotvector + CMEM stores (indexed/masked/shuffle/even-odd)all
0x047kVectorNopVALUvector no-op (shares GhPerf 0x1B4 with kVectorMaskMove)all
0x077/0x078kScalarLoad / kScalarStorescalar laneSMEM scalar accessall

The store window classified by LloOpcodeIsVectorStore (0x14024920) is internal {63,64,65,68,69,70}{460} — verified byte-exact ((op-63)<=7 && _bittest(0xE7, op-63) || op==460). The load window LloOpcodeIsVectorLoad (0x14024900) is {48..51}{457,458} — verified byte-exact. IarsPerTensorCore = 2 on every gen.

The Ghostlite perf classifier (GetGhostliteInstruction @ 0x1c8b1740) keys the indexed-memory ops on an IAR-present sentinel; the seven IAR-class opcodes are byte-exact (used as the IAR op-id ground truth):

LloOpcodeMnemonicRole
0x001kVectorReadIardrain IAR into a vreg
0x002kVectorSetIarLaneset IAR (lane mode)
0x003kVectorSetIarRawset IAR (raw mode)
0x004kVectorSetIarSublaneset IAR (sublane mode)
0x032kVectorLoadIndexedgather VMEM[base + IAR]
0x040kVectorStoreIndexedscatter to VMEM[base + IAR]
0x044kVectorStoreIndexedMaskedmasked scatter

QUIRK — the LLO IAR opcode number and the bundle-slot IAR opcode do not line up. LLO kVectorSetIarRaw (0x03) maps to ISA slot-opcode 4, and LLO kVectorSetIarSublane (0x04) maps to slot-opcode 3 — the Sublane/Raw encodings cross. A reimplementation that assumes LLO-opcode == slot-opcode swaps them. See Matprep/IAR/Latch §IAR.


DMA (0x0B3..0x0DA)

The 40 contiguous DMA opcodes — the largest single contiguous block — enumerate direction (HBM/VMEM/SMEM/CMEM/HIB/IMEM/Host + IOVA host) × source/destination-register variants + a WithHibUpdate family + two terminators. DMA is issued from the sequencer/scalar lane via descriptors; none set a result-FIFO Push/Pop bit (opcode_info low byte 0x00).

LloOpcodeMnemonicSemantics
0x0B3/0x0B4kDmaGeneral / kDmageneric DMA forms
0x0B5..0x0CCkDmaHbmToVmemkDmaSmemToVmemdirection matrix (HBM/VMEM/SMEM/CMEM/HIB/IMEM)
0x0CD..0x0D0kDmaHbmToVmemWithHibUpdate…VdstWithHibUpdateHBM→VMEM with HIB update
0x0D1..0x0D8kDmaHbmToHostkDmaSmemToHostIovahost DMA (direct + IOVA)
0x0D9/0x0DAkDmaDone / kDmaDoneWaitDMA completion / wait

NOTE — the cost model prices DMA by direction-class. The 14 HBM/Host-DMA opcodes collapse to GhPerf row 0x40 and the 14 VMEM/SMEM-DMA opcodes to row 0x43; the 40 distinct LloOpcode values all exist but the latency model treats each direction family as one.


Predicate / Mask Slot

Scalar predicate (1-bit Preg, P0..P14/15) and per-lane vector mask (Vmreg, M0..M31) ops — two physically distinct register files. Compare ops produce a mask; select consumes one. There is no native predicate-AND (synthesized via De Morgan). Bit layout on Predicate Slot; the M-register rectangle on vcreate_mask.

LloOpcodeMnemonicSlotSemantics
0x0E1kPredicateConstantpredicateremat-able predicate constant (opcode_info=0x00D0)
0x0E2/0x0E3kVectorMaskConstant / …Packedmaskmask constants
0x0E5..0x0E8kPredicateNegatekPredicateOrpredicatepredicate logic (share GhPerf row 0x032)
0x0E6kPredicateMovepredicatepredicate copy — Move-exclusion
0x167..0x16AkVectorComparekVectorAddCarryU16maskmask-producing compares + add-carry
0x193..0x199kVectorMaskXorkVectorMaskMovemaskmask logic / pack-compressed / negate / move
0x195kVectorMaskAndmaskmask AND (synthesized-rectangle combine)
0x198kVectorMaskNegatemaskmask complement
0x199kVectorMaskMovemaskmask copy — Move-exclusion

The four Move-exclusion opcodes — kScalarMove (0x17B), kVectorMove (0x181), kVectorMaskMove (0x199), kPredicateMove (0xE6) — are skipped by the CSE/remat/fold passes. Native vcreate_mask exists only on VF/GL (synthesized via iota-compare on JF/DF/PF).


Constants, Pseudo & Call

Constants, Phi/pseudo SSA nodes, and the call/tuple group. Constants set the remat bit (opcode_info 0x50); Phi nodes short-circuit to cost 0. Address materialization is remat-able. Literals route through immediate slots — see Immediate Slot.

LloOpcodeMnemonicSemantics
0x02C..0x02EkAllocationAddress / kParameterAddress / kIntToPtraddress materialization (remat-able)
0x0DB..0x0E0kScalarConstantU32kVectorConstantF32scalar / vector constants (U32/PackedBf16/F32)
0x0E4kVectorConstantU6464-bit vector constant
0x0E9..0x0ECkPredicatePhi / kScalarPhi / kVectorPhi / kVectorMaskPhiSSA phi nodes (cost 0)
0x0ED..0x0F0kTuple / kInlinedCall / kCall / kInlinedCallOperandcall / tuple structure
0x0F1..0x0F4kPredicatePseudokVectorMaskPseudopseudo placeholders per register class
0x17CkRelocatableConstantlink-time relocated constant

BarnaCore (SparseCore) Block (0x1AC..0x1CC)

The 33 highest opcodes are the BarnaCore (SparseCore) instruction set — embedding scatter/gather, sparse reduce, remote scalar writes, and the BarnaCore-local vector load/store/move. These are the opcodes the LLVM-MC emitter actually encodes (its populated InstBits_BarnaCorePxcHwMode table covers this range), unlike the TensorCore opcodes which it returns all-zero. The SparseCore scalar/vector opcode enums (the per-engine SCS/TAC/TEC scalar-ALU and vector-ALU rosters) are separate engine-family enums with their own deep pages — see SPU / Scalar Slot §SparseCore and VPU Slot.

LloOpcodeMnemonicSemantics
0x1AC..0x1B3kBarnaCoreScalarWaitDonekBarnaCoreScalarWaitNescalar wait / sync primitives
0x1B4..0x1B7kBarnaCoreScalarSyncDoneRead…SyncPublicAccessWritesync-flag read / write
0x1B8..0x1BBkBarnaCoreScalarPopkBarnaCoreDmapop / FSM issue / fence (kBarnaCoreScalarFence 0x1BA) / DMA
0x1BC..0x1C8kBarnaCoreRemoteScalarWritekBarnaCorePfLocalScatterGradientsremote write / scatter-gather / sparse-reduce
0x1C9..0x1CCkBarnaCoreVectorLoadkBarnaCoreVectorStoreBarnaCore vector load / store / move

QUIRK — the BarnaCore vector load/store opcodes (457..460) test as vector; the scalar/sync ones (428..456) test as non-vector. This is the only place the BarnaCore block straddles the vector/scalar partition, and it matters for register-class selection. The TpuSequencerType enum that selects the SparseCore engine codec is kTensorCoreSequencer(0)..kSparseCoreTileExecuteSequencer(5); TpuSequencerTypeFromProto (0x20b36300) maps proto 1..6 → internal 0..5 — verified byte-exact. BarnaCore is a pre-v5 construct (JF address-handler, PF sequencer); from v5 onward SparseCore (SCS/TAC/TEC) replaces it, and 6acc60406 (v7x) drops the TAC engine.


Per-Generation Availability Summary

LloOpcode numbering is gen-invariant; the valid subset grows per silicon. The codename → namespace map: jxc = Jellyfish (jellyfish::isa) + Dragonfish (alias); pxc = Pufferfish; vxc/vfc = Viperfish; glc = Ghostlite; gfc = 6acc60406.

GenerationCodenameLloOpcode additions over prior gen
TPU v2 / v3Jellyfish / Dragonfishbase set (proto-direct encoding, no Compact encoder)
TPU v4PufferfishF8 converts (0x061..0x063), S4/U4 int↔Bf16 (0x067/0x069/0x06B/0x06D), kCmemFence (0x01E), CMEM DMA/load/store
TPU v5eViperfishstochastic-rounding converts (0x070..0x074); SparseCore SCS/TAC/TEC engines; native vcreate_mask
TPU v6eGhostlitevector_misc slot ops; consolidated VectorSelect; 2 vector-store write ports
TPU7x6acc60406dual matrix staging (MSRA/MSRB); float-only FP8 matmul remap; dual-predicate slot; drops TAC engine, yield machinery

NOTE — append-and-insert, not append-only. New opcodes are inserted into the in-memory LloOpcode at their family's natural position (keeping families contiguous) but appended to the end of the LloOpcodeProto wire enum (preserving wire compatibility). This is why proto 498/499 map to low in-memory opcodes (0x084/0x197) — verified byte-exact. Port the actual LloOpcodeToProto table (0x344cb4c), not a formula. See LloOpcode↔Proto.


Verification Provenance

A representative sample of this table's value↔mnemonic↔slot bindings is grounded directly in the binary (classifiers, builders, converters). Probes (all CONFIRMED byte-exact):

ProbeBinary factConfirms
LloOpcodeName @ 0x1d631280bound >= 0x1CDud1; indexes opcode_name @ 0x21ccfef0461 dense enumerators
opcode_info / opcode_info_big symbols0x223a1320 / 0x227b5570 resolved by namemetadata table addresses
LloOpcodeIsVectorStore @ 0x14024920(op-63)<=7 && _bittest(0xE7,op-63) || op==460store window {63,64,65,68,69,70,460}
LloOpcodeIsVectorLoad @ 0x14024900(op-48)<4 || (op-457)<2load window {48..51,457,458}
LloOpcodeIsVectorMatprepSubr/Mubr/MatmulMubr(op-151)<2 / (op-153)<2 / ((op-156)&0xFFFB)==0matprep 0x97/0x98, 0x99/0x9a; matmul-mubr 0x9c/0xa0
EmitVectorMatresa3 != 338 guardkVectorMatres = 0x152 (338)
CreateVectorLatchLsf @ 0x1d4d7aa0New(141, …)kVectorLatchLsf = 0x8D (141)
LloOpcodeIsPseudoEupInstruction @ 0x1d60c880(op-315)<0x13 & (0x7FDFF >> (op-59))fused-EUP band 0x13B..0x14D minus 0x144
CreateVectorEupResult @ 0x1d4d9820asserts push opcode (op-296) < 0x13EUP push range [0x128,0x13A] → pop 0x14E
EncodeTensorCoreVectorAlu3F32Tanh (gfc)op=0 @194/w8, sel=19 @183/w5, src @188/w6F32 Tanh selector 0x13
EncodingToScalarRegister @ 0x1e871e40bound > 0x1F → error32-deep SREG file (5-bit)
TpuSequencerTypeFromProto @ 0x20b36300proto 1..6 → internal 0..5 switch6-value sequencer-type enum
LloOpcodeToProto @ 0x14420020GOT-relative table @ 0x344cb4cforward map is a flat lookup
ProtoToLloOpcode @ 0x14420040499-arm switch; proto 497→460, 498→132, 499→407; 38 __debugbreak armsreverse map + 38 reserved gaps

Cross-References