LLO Opcode Master Table
Every opcode value, mnemonic, slot name, and address on this page was read byte-exactly from
libtpu.soin thelibtpu-0.0.40-cp314wheel (BuildID89edbbe81c5b328a958fe628a9f2207d, not stripped). Other versions differ.
Abstract
This appendix is the single consolidated reference table for the LLO opcode space — the one table a reimplementer keeps open while building a TPU back end. LLO (Low-Level Optimizer IR) is the TPU-specific late compiler IR below MHLO/TLP and above the per-generation TensorCore bundle encoders; its in-memory opcode enum xla::jellyfish::LloOpcode is a dense, zero-based, 461-value numbering (0x000..0x1CC), and every value is named in the relocated opcode_name table at 0x21ccfef0. The authoritative per-family facts live on the isa/ deep pages; this page aggregates them into one master opcode→mnemonic→slot→semantics→per-gen table, grouped by the bundle slot (functional unit) each opcode is encoded into.
The table is organized by slot family, not by numeric run, because the value space interleaves: LloOpcodeIsVector is a dense switch, not a >= threshold test, and (for example) kVectorReadIar (1) is vector while kLog (5) is not. Each slot section below enumerates the opcodes that the per-slot encoder serializes into that slot, with the numeric LloOpcode value, a one-line semantics, and per-generation availability where it varies (the five codec lineages: jxc Jellyfish/Dragonfish v2/v3, pxc Pufferfish v4, vfc/vxc Viperfish v5p, glc Ghostlite v6e, gfc 6acc60406 TPU7x). The bit-level field encodings — where each opcode's operands land in the 41/51/64-byte bundle — are not repeated here; they live on the per-slot deep pages, which every section cross-links.
Three index spaces must not be conflated with LloOpcode, and a reimplementer who confuses them mis-decodes every program: the LloOpcodeProto wire enum (1-based, max 499, 38 reserved gaps, see LloOpcode↔Proto); the MC MCInst opcode space the LLVM-MC emitter dispatches on (LloOpcode + 499, gated <= 0x1F2); and the GhPerf::Instruction cost-grid enum. LloOpcode is the one all the others map from. A representative sample of the values in this table was re-verified against the binary classifiers and converters (see Verification); where a deep page and the binary agreed they were taken as CONFIRMED, and no disagreements were found.
| Enum | xla::jellyfish::LloOpcode — 461 enumerators, dense 0x000..0x1CC (0..460) |
| Name accessor | LloOpcodeName(LloOpcode) @ 0x1d631280 — bound >= 0x1CD → ud1 |
| Name table | opcode_name @ 0x21ccfef0 (461 × char*, R_X86_64_RELATIVE) |
| Property word | opcode_info @ 0x223a1320 (461 × uint16: Push/Pop/Remat/Fold/Cse + reg-file class) |
| Descriptor | opcode_info_big @ 0x227b5570 (461 × 28 B: ResultFifo + ArchRegister lists) |
| Family classifier | LloOpcodeIsVector @ 0x1d60c1c0; LloOpcodeIsScalar = !IsVector @ 0x1d60c7e0 |
| Wire converter | LloOpcodeToProto @ 0x14420020 (table @ 0x344cb4c); ProtoToLloOpcode @ 0x14420040 |
| Confidence | CONFIRMED (byte-anchored) unless a row says otherwise |
For reimplementation, the contract this table supports is:
- The opcode numbering is gen-invariant — the same
LloOpcodevalue means the same opcode on every generation; what changes per gen is the valid subset (sparsity is v5+, cmem-load is PF-only) and the bundle encoding, not the number. - The slot a value occupies (which bundle field the per-gen encoder serializes it into) is recovered from the per-slot classifier switches and encoder dispatch, listed per section.
- The per-gen availability column flags opcodes that exist only from a given generation (F8 converts PF+, stochastic-round VF+, S4/U4 matmul PF+, BarnaCore vs SparseCore split).
- The bit encodings are deferred to the per-slot deep pages; this table is value↔mnemonic↔slot↔semantics only.
Slot Families at a Glance
Eleven functional-unit families partition the 461 opcodes (plus a 33-opcode BarnaCore/SparseCore block). The count column is the number of distinct LloOpcode values the family owns; the owning page is the deep page that documents that slot's bit encoding. Values are mostly-contiguous bands; the exact members are in the per-family sections below.
| Slot family | Value band(s) | Opcodes | Owning deep page |
|---|---|---|---|
| Sequencer / sync / FIFO transfer | 0x000..0x030 (interleaved) | ~50 | Sequencer Slot · Seq-Ops-Per-Gen |
| SPU / scalar ALU & control | 0x085..0x089, 0x16B..0x1AA | 63 | SPU / Scalar Slot |
| VPU / vector ALU & logic | 0x048..0x05A, 0x11B..0x1A2 | (subset of 295 vector) | VPU Slot |
| Convert / pack / unpack | 0x05B..0x076, 0x107..0x11A, 0x125..0x127 | ~45 | VPU Slot · EUP |
| EUP transcendental | 0x128..0x14D | 38 | EUP / Transcendental Slot |
| Cross-lane / reduce / XLU result | 0x036..0x03B, 0x0F5..0x101, 0x14E..0x155 | ~28 | ResultFifo / ArchRegister |
| MXU matmul / latch / matprep / matres | 0x08D..0x0AB, 0x152..0x153 | ~33 | MXU Slot · Matprep/IAR/Latch |
| Memory load / store / IAR / RNG | 0x001..0x004, 0x030..0x046, 0x077..0x078 | ~30 | Memory-Load · Memory-Store · CMEM-Load |
| DMA | 0x0B3..0x0DA (contiguous) | 40 | Sequencer Slot (DMA-issue) |
| Predicate / mask | 0x0E1..0x0F1, 0x167..0x16A, 0x193..0x199 | ~18 | Predicate Slot · vcreate_mask |
| Constants / pseudo / call | 0x02C..0x02E, 0x0DB..0x0F4, 0x17C | ~25 | Immediate Slot |
| BarnaCore (SparseCore) | 0x1AC..0x1CC (contiguous) | 33 | LLO Opcode Enum |
| Sparsity (v5+) structured-sparse | — | (open) | Sparsity Slot (v5+) |
NOTE — the sparsity slot is not yet recovered. The structured-sparsity slot (Sparsity Slot (v5+)) is a v5+/Viperfish-and-later feature whose backing analysis is still open; no distinct
LloOpcodevalues for it have been pinned. It is listed for completeness but contributes no rows to the tables below. A reimplementer targeting only v3/v4 can ignore it entirely.
QUIRK — 461, not 462. The "462" figure that floats around LLO documentation is the
LloOpcodeProtowire enum (1-based, value-0 sentinel, max 499, 38 gaps). The in-memoryLloOpcodethis table indexes is exactly 461 dense values (0x000..0x1CC);LloOpcodeNametraps on>= 0x1CD. Drive a switch off 462 and the last index reads past the metadata tables. See LloOpcode↔Proto.
Sequencer, Sync & FIFO-Transfer Slot (0x000..0x030)
The control-and-handshake layer: program boundaries, scheduling barriers, fences, cycle-counter reads, halt/yield, and the scalar/vector/sync-flag result-FIFO push/pop primitives that move values between the SPU, VPU, and the cross-FIFO staging registers. PC mutation (branch/call/halt) is lane-0-only; sync ops live in a separate ScalarMisc lane on v5+. The per-(gen × sequencer-type) control-flow roster is on Seq-Ops-Per-Gen.
LloOpcode | Mnemonic | Slot / lane | Semantics | Per-gen |
|---|---|---|---|---|
0x000 | kEvent | sequencer | program / trace event marker | all |
0x005 | kLog | sequencer | debug-log marker (non-vector) | all |
0x006/0x007 | kHloStart / kHloEnd | sequencer | source-HLO span markers (debug-info) | all |
0x008 | kSchedulingBarrier | sequencer | hard reorder barrier (vetoes CSE across it) | all |
0x009 | kScalarToVector | V↔S bridge | scalar→vector broadcast push | all |
0x00A | kVectorToScalarPush | V→S FIFO | vector→scalar FIFO push (Push bit) | all (v6e GhPerf 0x1DA) |
0x00B | kSyncFlagToScalarPush | sync→S FIFO | sync-flag→scalar FIFO push | all |
0x00A..0x012 | kVectorToScalarPush … kDrfPop | V↔S / sync FIFO | V→S / sync-flag→scalar push/pop quartet | all |
0x011/0x012 | kSfrfPop / kDrfPop | FIFO pop | sync-flag-result / divrem-result FIFO pop (Pop bit) | all |
0x013/0x014 | kScalarFence / kVectorStoreFence | sequencer / misc | ordering fences (store-fence → misc slot on gfc) | all |
0x015..0x01C | kScalarCcfPush … kVectorCcfPopAsymmetrical | CCF FIFO | cross-core FIFO push/pop (symmetric + asymmetric) | all |
0x01D | kMegacoreSwapCoresPseudo | sequencer | megacore core-swap pseudo | all |
0x01E | kCmemFence | sequencer | CMEM ordering fence | PF+ |
0x01F..0x024 | kScalarReadCycleStart … kScalarReadCycleLow | sequencer | cycle-counter reads / 64-bit lo/hi splits | all |
0x025..0x027 | kScalarHalt / …YieldConditional / …OnError | sequencer (lane 0) | sequencer halt variants (GhPerf row 0x000) | all |
0x029 | kProgramLaunchSc | sequencer | SparseCore program launch | all |
0x02B | kVectorInterrupt | sequencer | vector-side interrupt | all |
NOTE — Push/Pop are property bits, not opcode ranges.
opcode_infobit0 (Push) is set on the*Pushmembers, bit1 (Pop) on the*Popmembers. A scheduler models the V2S / sync FIFOs by reading these bits, not by a numeric range test.kScalarHalt/kScalarHaltOnErrorshareGhPerfcost row 0x000.
SPU / Scalar-ALU Slot
The scalar control CPU lane — 32-entry SREG file, address arithmetic, branch/call/select, the dense arithmetic-and-shift tail, and the V↔S FIFO pops. The register-file-class byte (opcode_info high byte) is 1 (scalar/mask). Lane 0 owns the full set incl. branch/call; lane 1 is an ALU+halt mirror. Bit layout on SPU / Scalar Slot.
LloOpcode | Mnemonic | Slot | Semantics |
|---|---|---|---|
0x077/0x078 | kScalarLoad / kScalarStore | scalar lane | SMEM↔SREG memory access (lane-1 bound on JF/PF) |
0x085 | kScalarComposeU64 | scalar | pack two u32 into a u64 |
0x086 | kScalarAddressCalculation | scalar | address arithmetic (shares GhPerf 0x00C with kScalarAddS32) |
0x087/0x088 | kScalarBranchRel / kScalarBranchInd | sequencer (lane 0) | relative / indirect branch |
0x089 | kScalarSelect | scalar | scalar conditional select |
0x16B/0x16C | kScalarCompare / kScalarAddCarryU32 | scalar | compare + add-with-carry |
0x16D..0x172 | kScalarMultiplyWordAddr … kScalarAddS32 | scalar | multiplies + adds (u24/u32/f32/s32) |
0x173..0x177 | kScalarSubtractS32 … kScalarBitwiseXor | scalar | subtract + bitwise and/or/xor |
0x178..0x17A | kScalarDivRemU32 / …U32AndPop / …RemU32AndPop | scalar | divide/remainder (data-format special-cased) |
0x17B | kScalarMove | scalar | scalar copy — Move-exclusion (never CSE'd/rematted) |
0x17D..0x17F | kScalarFloorF32 / kScalarCeilF32 / kScalarCountLeadingZeros | scalar | scalar rounding + CLZ |
0x1A3..0x1A6 | kScalarShrl … kScalarShllOnes | scalar | logical / arithmetic shifts |
0x1A7..0x1AA | kScalarMinimumF32 … kScalarMaximumU32 | scalar | min / max (f32/u32) |
The SPU is the complement of the vector set (LloOpcodeIsScalar = !LloOpcodeIsVector), 63 opcodes total; EncodingToScalarRegister (0x1e871e40) bounds the SREG selector at > 0x1F (32 registers, 5-bit field) — verified byte-exact. The per-gen jump-table size is the per-gen scalar ISA size (JF 0..0x3E, PF ≤0x33, VF ≤0x4C, GL/GF ≤0x49).
VPU / Vector-ALU Slot
The per-lane vector ALU over the 8 sublane × 128 lane = 1024-element vreg. Register-file-class byte 2 (vector); the foldable/CSE bits (opcode_info bit5/bit6) are set on the pure-functional members. LloOpcodeIsVectorUnop (0x1d60c200) / LloOpcodeIsVectorBinop (0x1d60c680) split unary vs binary. Bit layout (6/7/8-bit opcode, 5/4/2-bit predicate per gen) on VPU Slot.
LloOpcode | Mnemonic | Slot | Semantics |
|---|---|---|---|
0x048..0x050 | kVectorClampGezF32 … kVectorRemapBf16 | VALU | clamp / remap (gez, symmetric, asymmetric; F32/Bf16/S4) |
0x051..0x05A | kVectorMultiplyAccumulate … kVectorMoveEvenAccLow | VALU | MAC + accumulator moves (FOLD bit on MAC family) |
0x11B..0x124 | kVectorAddS32 … kVectorSubtractS16 | VALU | add / subtract (s32/f32/bf16/s16, Bf16-hi/lo) |
0x156..0x15B | kVectorPowF32 … kVectorMultiplyBf16 | VALU | pow + multiply (F32/U32/U16/Bf16) |
0x15C..0x15F | kVectorAndU32 … kVectorXorU32 | VALU | bitwise and / and-negated / or / xor |
0x162..0x166 | kVectorMultiplyComposeU64 … kVectorExtractHigh32 | VALU | 64-bit multiply compose + word extracts |
0x180..0x184 | kVectorCountLeadingZeros … kVectorExtractSignificand | VALU | CLZ / move / popcount / FP field extracts |
0x181 | kVectorMove | VALU | vector copy — Move-exclusion (never CSE'd/rematted) |
0x19A..0x19C | kVectorShiftRightLogical … kVectorShiftLeftLogical | VALU | vector shifts |
0x19D..0x1A2 | kVectorMaximumF32 … kVectorMinimumU32 | VALU | min / max (F32/Bf16/U32) |
The VALU is the largest family; only the representative bands are shown. vmul.u32.u64 is a PF+ slot-pair wide multiply. The VectorAluYEncoding (0..31) source-B selector — vreg / VS0-2 ports / hardwired float constants / immediate slots — is documented on VPU Slot §Y-operand.
Convert / Pack / Unpack
Type conversions across three bands: f32→{s32,f8,hf16} and {s8,s4,u8,u4}↔bf16, the unpack/round/truncate block, and pack/compose. These gained the most across generations.
LloOpcode | Mnemonic | Slot | Semantics | Per-gen |
|---|---|---|---|---|
0x05B..0x060 | kScalarConvertF32ToS32WithProbRounding … …TowardsZeroPseudo | VALU/scalar | F32→S32 (prob-round / towards-zero) | all |
0x061..0x063 | kVectorConvertF32ToF8E5M2 / …E4M3Fn / …E4M3B11 | VALU | F32→F8 | PF+ |
0x064 | kVectorConvertF32ToHf16 | VALU | F32→Hf16 | all |
0x066..0x06D | kVectorConvertS8ToBf16 … kVectorConvertBf16ToU4 | VALU | int↔Bf16 (s8/s4/u8/u4) | S4/U4 = PF+ |
0x06E/0x06F | kVectorConvertEXMYToE4M3 / …ToE5M2 | VALU | generic FP8 reformat | PF+ |
0x070..0x074 | kVectorConvertF32ToE5M2Stochastic … …ToHf16Stochastic | VALU | stochastic-rounding converts | VF+ |
0x107..0x10F | kScalarConvertS32ToF32 … kVectorDynamicUnpack | VALU/scalar | S32→F32 convert + unpack (kVectorUnpack 0x109) + B2→B4 / B4→B8 join + EXMY/dynamic | all |
0x110..0x11A | kVectorCeilF32 … kVectorTruncateBf16 | VALU | round-to-int / RTNA / RTNE / truncate | all |
0x125..0x127 | kVectorComposeF32 / kVectorPack / kVectorPackEXMY | VALU | compose + pack | all |
EUP / Transcendental Slot (0x128..0x14D)
The Extended Unary Pipeline computes nine transcendentals (tanh, pow2, reciprocal, log2, rsqrt, shifted-sigmoid, sinq, cosq, erf) as a deferred-result pipeline: a push (VALU slot 3 only on v5+) and a later pop from the VectorResult slot. Each function appears in four LLO forms — F32 issue-only, Bf16 issue-only, F32 issue+pop (AndPop), Bf16 issue+pop. Selector map and latency on EUP / Transcendental Slot.
| Band | LloOpcode | Members | Form |
|---|---|---|---|
| Bare push F32 | 0x128..0x131 | Tanh/Pow2/Recip/Log2/Rsqrt/SigShft/Sinq/Cosq/Erf + kVectorPushErf | F32 issue-only (Push bit) |
| Bare push Bf16 | 0x132..0x13A | same set | Bf16 issue-only |
| Fused F32 | 0x13B..0x144 | same set + kVectorPushErfAndPop (0x144) | F32 issue+pop (*AndPop) |
| Fused Bf16 | 0x145..0x14D | same set | Bf16 issue+pop |
| Deferred pop | 0x14E | kVectorEupResult | pop EUP FIFO (Pop bit, opcode_info=0x0202) |
Representative function-selector values (the 5-bit field, gfc/glc @ bit 183): F32 tanh=0x13, rsqrt=0x10, pow2=0x11, log2=0x12, sigmoid=0x14, reciprocal=0x15, sin=0x17, cos=0x18, erf=0x0e; Bf16 selectors differ (e.g. tanh=0x1b). The 18 *AndPop pseudo-ops (excluding 0x144) are split by LloLateDecomposer into bare push + deferred pop; LloOpcodeIsPseudoEupInstruction (0x1d60c880) classifies them via the 0x7FDFF mask — verified byte-exact.
NOTE — the EUP family is the cleanest Push/Pop illustration. Every issue-only EUP opcode sets
opcode_infobit0 (Push);kVectorEupResult(0x14E) sets bit1 (Pop).CreateVectorEupResult(0x1d4d9820) asserts the push operand is in[0x128, 0x13A]before building the0x14Epop — verified byte-exact.
Cross-Lane, Reduce & XLU-Result
Cross-lane permute/rotate/broadcast, whole/segmented reductions (min/max/add/argmin/argmax in F32 and Bf16), and the deferred-result pops (EUP, cross-lane, permute, CMEM, transpose, matmul). The 21 XLU opcodes are walked by ComputeXluOperations (0x126d9780) and emit {TransposeTile, RpuOperation, XluControlOperation} variants — see ResultFifo / ArchRegister.
LloOpcode | Mnemonic | Slot | Semantics |
|---|---|---|---|
0x036..0x03B | kVectorPermute … kVectorBroadcastLane | XLU (load slot) | cross-lane permute / rotate / combine / broadcast |
0x08B/0x08C | kVectorSetPermutePattern / kVectorSetSegmentPattern | XLU control | permute / segment pattern setup |
0x0F5..0x0FC | kVectorMinReduceF32 … kVectorAddSegmentReduceF32 | XLU (reduce) | F32 reductions (whole + segmented + index) |
0x0FD..0x101 | kVectorMinReduceBf16 … kVectorMinIndexReduceBf16 | XLU (reduce) | Bf16 reductions |
0x102..0x106 | kVectorSublaneId … kVectorLaneSequenceInterleavedB16 | VALU | lane/sublane identity sequences (remat-able) |
0x14E | kVectorEupResult | VectorResult | pop EUP result FIFO |
0x14F..0x151 | kVectorXlaneResult / kVectorPermuteResult / kVectorCmemResult | VectorResult | pop XLU / permute / CMEM result FIFOs |
0x152/0x153 | kVectorMatres / kVectorMatresAdd | VectorResult | pop MXU result (plain / accumulate) |
0x154/0x155 | kVectorTransposeResult / kVectorTransposeClear | VectorResult | pop transpose result / clear transpose FIFO |
kVectorXlaneResult/kVectorPermuteResult/kVectorTransposeResult share GhPerf cost row 0x1C7. kVectorTransposeClear (0x155) is the single opcode whose opcode_info word is 0x000E.
MXU — Matmul, Latch, Matprep, Result
The systolic-array op family: latch the stationary weights, matprep the moving operand, issue the matmul, drain via matres. Latch/matprep/matmul share the VectorExtended slot; matres uses VectorResult. The matmul band (0x8D..0xA5) is data-format special-cased before the property-word read. Field encoding (per-gen 6→7→8-bit opcode, the −20/−21/−25 inter-MXU twin) on MXU Slot; latch/matprep field maps on Matprep/IAR/Latch.
LloOpcode | Mnemonic | Slot | Semantics | Builder / classifier |
|---|---|---|---|---|
0x08D | kVectorLatchLsf | VectorExtended | latch stationary weights (load-stationary-from-FIFO) | CreateVectorLatchLsf @ 0x1d4d7aa0 → New(141) |
0x08E | kVectorLatchLsfMsk | VectorExtended | masked LSF latch | — |
0x08F..0x096 | kVectorLatch … kVectorLatch3Msk | VectorExtended | gain-matrix latch (plain / 1..3 sub-bank, masked) | CreateVectorLatchHelper |
0x097/0x098 | kVectorMatprepSubr (+Msk) | VectorExtended | push moving operand, sub-row form | (op-151)<2 @ 0x1d60c400 |
0x099/0x09A | kVectorMatprepMubr (+Msk) | VectorExtended | push moving operand, block-row form | (op-153)<2 @ 0x1d60c3e0 |
0x09B | kVectorMatmul | VectorExtended | one systolic step | EmitVectorMatmul @ 0x140b92c0 |
0x09C/0x0A0 | kVectorMatmulMubr (+Msk) | VectorExtended | conv block-row matmul | ((op-156)&0xFFFB)==0 @ 0x1d60c3c0 |
0x09D/0x09E | kVectorMatmulHigh / kVectorMatmulLow | VectorExtended | high-half / low-half accumulator step | EmitVectorMatmul |
0x09F..0x0A5 | kVectorMatmulMsk … kVectorMatmulLmr | VectorExtended | masked / packed (kVectorMatmulPacked 0x0A3) / LMR-fused matmul | — |
0x0A6/0x0A7 | kVectorTranspose / kVectorTransposeBinary | VectorExtended | XLU transpose (vxpose-mode dispatched) | — |
0x0A8..0x0AB | kVectorDoneWithGains … kVectorLoadLmrWithBf16Conversion | VectorExtended | gain handshake + GMR/LMR loads | — |
0x152/0x153 | kVectorMatres / kVectorMatresAdd | VectorResult | matmul result collection (plain / accumulate) | a3 != 338 @ EmitVectorMatres |
Per-gen dtype set: PF/VF/GL carry 8 formats {F32, If8, Bf16, Bf8} × float + {U8, S8, U4, S4} × int; gfc (v6e) is float-only 4-format {F32, E4m3, Bf16, E5m2}. MXUs per TensorCore: 1 (JF) / 4 (PF, VF) / 2 (GL, gfc). MSR banks: 1 (JF/PF) / 2 (VF/GL/gfc).
GOTCHA — the matmul push bit is data-format-dependent, not static.
LloInstructionPushesToResultFifotests the matmul band (0x8D..0xA5) via a bitmask and routes to amatmul_data_formatvtable call; reading the static Push bit fromopcode_info[op]for a matmul opcode gives the wrong FIFO behavior. The push depends on the data format decided at the instruction instance.
Memory Load / Store / IAR / RNG
Per-lane index-address-register (IAR) setup, the per-lane PRNG, and the load/store slots. The tier (VMEM/SMEM/CMEM/SPMEM) is selected by which slot the op occupies, not a tier bit. CMEM-load is a dedicated PF-only slot. Bit layout on Memory-Load, Memory-Store, CMEM-Load (PF).
LloOpcode | Mnemonic | Slot | Semantics | Per-gen |
|---|---|---|---|---|
0x001 | kVectorReadIar | load slot | read index-address register into a vreg | all |
0x002..0x004 | kVectorSetIarLane / …Raw / …Sublane | store slot | set IAR (lane/raw/sublane; LLO≠slot opcode) | all |
0x030..0x035 | kVectorLoadSublaneShuffle … kVectorCmemLoadAndPop | load slot | vector + CMEM loads | all (CMEM = PF) |
0x03C..0x03E | kVectorPrng / kVectorSetRngSeed / kVectorGetRngSeed | VALU/store | per-lane PRNG (seeds stochastic converts) | all |
0x03F..0x046 | kVectorStore … kVectorStoreEvenOddSublanes | store slot | vector + CMEM stores (indexed/masked/shuffle/even-odd) | all |
0x047 | kVectorNop | VALU | vector no-op (shares GhPerf 0x1B4 with kVectorMaskMove) | all |
0x077/0x078 | kScalarLoad / kScalarStore | scalar lane | SMEM scalar access | all |
The store window classified by LloOpcodeIsVectorStore (0x14024920) is internal {63,64,65,68,69,70} ∪ {460} — verified byte-exact ((op-63)<=7 && _bittest(0xE7, op-63) || op==460). The load window LloOpcodeIsVectorLoad (0x14024900) is {48..51} ∪ {457,458} — verified byte-exact. IarsPerTensorCore = 2 on every gen.
The Ghostlite perf classifier (GetGhostliteInstruction @ 0x1c8b1740) keys the indexed-memory ops on an IAR-present sentinel; the seven IAR-class opcodes are byte-exact (used as the IAR op-id ground truth):
LloOpcode | Mnemonic | Role |
|---|---|---|
0x001 | kVectorReadIar | drain IAR into a vreg |
0x002 | kVectorSetIarLane | set IAR (lane mode) |
0x003 | kVectorSetIarRaw | set IAR (raw mode) |
0x004 | kVectorSetIarSublane | set IAR (sublane mode) |
0x032 | kVectorLoadIndexed | gather VMEM[base + IAR] |
0x040 | kVectorStoreIndexed | scatter to VMEM[base + IAR] |
0x044 | kVectorStoreIndexedMasked | masked scatter |
QUIRK — the LLO IAR opcode number and the bundle-slot IAR opcode do not line up. LLO
kVectorSetIarRaw(0x03) maps to ISA slot-opcode 4, and LLOkVectorSetIarSublane(0x04) maps to slot-opcode 3 — the Sublane/Raw encodings cross. A reimplementation that assumes LLO-opcode == slot-opcode swaps them. See Matprep/IAR/Latch §IAR.
DMA (0x0B3..0x0DA)
The 40 contiguous DMA opcodes — the largest single contiguous block — enumerate direction (HBM/VMEM/SMEM/CMEM/HIB/IMEM/Host + IOVA host) × source/destination-register variants + a WithHibUpdate family + two terminators. DMA is issued from the sequencer/scalar lane via descriptors; none set a result-FIFO Push/Pop bit (opcode_info low byte 0x00).
LloOpcode | Mnemonic | Semantics |
|---|---|---|
0x0B3/0x0B4 | kDmaGeneral / kDma | generic DMA forms |
0x0B5..0x0CC | kDmaHbmToVmem … kDmaSmemToVmem | direction matrix (HBM/VMEM/SMEM/CMEM/HIB/IMEM) |
0x0CD..0x0D0 | kDmaHbmToVmemWithHibUpdate … …VdstWithHibUpdate | HBM→VMEM with HIB update |
0x0D1..0x0D8 | kDmaHbmToHost … kDmaSmemToHostIova | host DMA (direct + IOVA) |
0x0D9/0x0DA | kDmaDone / kDmaDoneWait | DMA completion / wait |
NOTE — the cost model prices DMA by direction-class. The 14 HBM/Host-DMA opcodes collapse to GhPerf row 0x40 and the 14 VMEM/SMEM-DMA opcodes to row 0x43; the 40 distinct
LloOpcodevalues all exist but the latency model treats each direction family as one.
Predicate / Mask Slot
Scalar predicate (1-bit Preg, P0..P14/15) and per-lane vector mask (Vmreg, M0..M31) ops — two physically distinct register files. Compare ops produce a mask; select consumes one. There is no native predicate-AND (synthesized via De Morgan). Bit layout on Predicate Slot; the M-register rectangle on vcreate_mask.
LloOpcode | Mnemonic | Slot | Semantics |
|---|---|---|---|
0x0E1 | kPredicateConstant | predicate | remat-able predicate constant (opcode_info=0x00D0) |
0x0E2/0x0E3 | kVectorMaskConstant / …Packed | mask | mask constants |
0x0E5..0x0E8 | kPredicateNegate … kPredicateOr | predicate | predicate logic (share GhPerf row 0x032) |
0x0E6 | kPredicateMove | predicate | predicate copy — Move-exclusion |
0x167..0x16A | kVectorCompare … kVectorAddCarryU16 | mask | mask-producing compares + add-carry |
0x193..0x199 | kVectorMaskXor … kVectorMaskMove | mask | mask logic / pack-compressed / negate / move |
0x195 | kVectorMaskAnd | mask | mask AND (synthesized-rectangle combine) |
0x198 | kVectorMaskNegate | mask | mask complement |
0x199 | kVectorMaskMove | mask | mask copy — Move-exclusion |
The four Move-exclusion opcodes — kScalarMove (0x17B), kVectorMove (0x181), kVectorMaskMove (0x199), kPredicateMove (0xE6) — are skipped by the CSE/remat/fold passes. Native vcreate_mask exists only on VF/GL (synthesized via iota-compare on JF/DF/PF).
Constants, Pseudo & Call
Constants, Phi/pseudo SSA nodes, and the call/tuple group. Constants set the remat bit (opcode_info 0x50); Phi nodes short-circuit to cost 0. Address materialization is remat-able. Literals route through immediate slots — see Immediate Slot.
LloOpcode | Mnemonic | Semantics |
|---|---|---|
0x02C..0x02E | kAllocationAddress / kParameterAddress / kIntToPtr | address materialization (remat-able) |
0x0DB..0x0E0 | kScalarConstantU32 … kVectorConstantF32 | scalar / vector constants (U32/PackedBf16/F32) |
0x0E4 | kVectorConstantU64 | 64-bit vector constant |
0x0E9..0x0EC | kPredicatePhi / kScalarPhi / kVectorPhi / kVectorMaskPhi | SSA phi nodes (cost 0) |
0x0ED..0x0F0 | kTuple / kInlinedCall / kCall / kInlinedCallOperand | call / tuple structure |
0x0F1..0x0F4 | kPredicatePseudo … kVectorMaskPseudo | pseudo placeholders per register class |
0x17C | kRelocatableConstant | link-time relocated constant |
BarnaCore (SparseCore) Block (0x1AC..0x1CC)
The 33 highest opcodes are the BarnaCore (SparseCore) instruction set — embedding scatter/gather, sparse reduce, remote scalar writes, and the BarnaCore-local vector load/store/move. These are the opcodes the LLVM-MC emitter actually encodes (its populated InstBits_BarnaCorePxcHwMode table covers this range), unlike the TensorCore opcodes which it returns all-zero. The SparseCore scalar/vector opcode enums (the per-engine SCS/TAC/TEC scalar-ALU and vector-ALU rosters) are separate engine-family enums with their own deep pages — see SPU / Scalar Slot §SparseCore and VPU Slot.
LloOpcode | Mnemonic | Semantics |
|---|---|---|
0x1AC..0x1B3 | kBarnaCoreScalarWaitDone … kBarnaCoreScalarWaitNe | scalar wait / sync primitives |
0x1B4..0x1B7 | kBarnaCoreScalarSyncDoneRead … …SyncPublicAccessWrite | sync-flag read / write |
0x1B8..0x1BB | kBarnaCoreScalarPop … kBarnaCoreDma | pop / FSM issue / fence (kBarnaCoreScalarFence 0x1BA) / DMA |
0x1BC..0x1C8 | kBarnaCoreRemoteScalarWrite … kBarnaCorePfLocalScatterGradients | remote write / scatter-gather / sparse-reduce |
0x1C9..0x1CC | kBarnaCoreVectorLoad … kBarnaCoreVectorStore | BarnaCore vector load / store / move |
QUIRK — the BarnaCore vector load/store opcodes (457..460) test as vector; the scalar/sync ones (428..456) test as non-vector. This is the only place the BarnaCore block straddles the vector/scalar partition, and it matters for register-class selection. The TpuSequencerType enum that selects the SparseCore engine codec is
kTensorCoreSequencer(0)..kSparseCoreTileExecuteSequencer(5);TpuSequencerTypeFromProto(0x20b36300) maps proto 1..6 → internal 0..5 — verified byte-exact. BarnaCore is a pre-v5 construct (JF address-handler, PF sequencer); from v5 onward SparseCore (SCS/TAC/TEC) replaces it, and 6acc60406 (v7x) drops the TAC engine.
Per-Generation Availability Summary
LloOpcode numbering is gen-invariant; the valid subset grows per silicon. The codename → namespace map: jxc = Jellyfish (jellyfish::isa) + Dragonfish (alias); pxc = Pufferfish; vxc/vfc = Viperfish; glc = Ghostlite; gfc = 6acc60406.
| Generation | Codename | LloOpcode additions over prior gen |
|---|---|---|
| TPU v2 / v3 | Jellyfish / Dragonfish | base set (proto-direct encoding, no Compact encoder) |
| TPU v4 | Pufferfish | F8 converts (0x061..0x063), S4/U4 int↔Bf16 (0x067/0x069/0x06B/0x06D), kCmemFence (0x01E), CMEM DMA/load/store |
| TPU v5e | Viperfish | stochastic-rounding converts (0x070..0x074); SparseCore SCS/TAC/TEC engines; native vcreate_mask |
| TPU v6e | Ghostlite | vector_misc slot ops; consolidated VectorSelect; 2 vector-store write ports |
| TPU7x | 6acc60406 | dual matrix staging (MSRA/MSRB); float-only FP8 matmul remap; dual-predicate slot; drops TAC engine, yield machinery |
NOTE — append-and-insert, not append-only. New opcodes are inserted into the in-memory
LloOpcodeat their family's natural position (keeping families contiguous) but appended to the end of theLloOpcodeProtowire enum (preserving wire compatibility). This is why proto 498/499 map to low in-memory opcodes (0x084/0x197) — verified byte-exact. Port the actualLloOpcodeToPrototable (0x344cb4c), not a formula. See LloOpcode↔Proto.
Verification Provenance
A representative sample of this table's value↔mnemonic↔slot bindings is grounded directly in the binary (classifiers, builders, converters). Probes (all CONFIRMED byte-exact):
| Probe | Binary fact | Confirms |
|---|---|---|
LloOpcodeName @ 0x1d631280 | bound >= 0x1CD → ud1; indexes opcode_name @ 0x21ccfef0 | 461 dense enumerators |
opcode_info / opcode_info_big symbols | 0x223a1320 / 0x227b5570 resolved by name | metadata table addresses |
LloOpcodeIsVectorStore @ 0x14024920 | (op-63)<=7 && _bittest(0xE7,op-63) || op==460 | store window {63,64,65,68,69,70,460} |
LloOpcodeIsVectorLoad @ 0x14024900 | (op-48)<4 || (op-457)<2 | load window {48..51,457,458} |
LloOpcodeIsVectorMatprepSubr/Mubr/MatmulMubr | (op-151)<2 / (op-153)<2 / ((op-156)&0xFFFB)==0 | matprep 0x97/0x98, 0x99/0x9a; matmul-mubr 0x9c/0xa0 |
EmitVectorMatres | a3 != 338 guard | kVectorMatres = 0x152 (338) |
CreateVectorLatchLsf @ 0x1d4d7aa0 | New(141, …) | kVectorLatchLsf = 0x8D (141) |
LloOpcodeIsPseudoEupInstruction @ 0x1d60c880 | (op-315)<0x13 & (0x7FDFF >> (op-59)) | fused-EUP band 0x13B..0x14D minus 0x144 |
CreateVectorEupResult @ 0x1d4d9820 | asserts push opcode (op-296) < 0x13 | EUP push range [0x128,0x13A] → pop 0x14E |
EncodeTensorCoreVectorAlu3F32Tanh (gfc) | op=0 @194/w8, sel=19 @183/w5, src @188/w6 | F32 Tanh selector 0x13 |
EncodingToScalarRegister @ 0x1e871e40 | bound > 0x1F → error | 32-deep SREG file (5-bit) |
TpuSequencerTypeFromProto @ 0x20b36300 | proto 1..6 → internal 0..5 switch | 6-value sequencer-type enum |
LloOpcodeToProto @ 0x14420020 | GOT-relative table @ 0x344cb4c | forward map is a flat lookup |
ProtoToLloOpcode @ 0x14420040 | 499-arm switch; proto 497→460, 498→132, 499→407; 38 __debugbreak arms | reverse map + 38 reserved gaps |
Cross-References
- LLO Opcode Enum — the master family taxonomy, the
opcode_info/opcode_info_bigmetadata tables, and the per-family value ranges this table consolidates. - Sequencer Ops Per Gen — the per-(generation × sequencer-type) control-flow op inventory and the
TpuSequencerTypeenum. - InstBits Master DB — the LLVM-MC base-bits table (
opcode − 499index) that encodes only the BarnaCore subset. - LloOpcode↔Proto — the wire-value converters and the 38 reserved proto gaps.
- ResultFifo / ArchRegister — the FIFO and physical-register enums the XLU/result-pop opcodes reference.
- MXU Slot · Matprep/IAR/Latch — matmul/latch/matprep/matres bit encoding and the IAR.
- VPU Slot · EUP / Transcendental Slot — vector-ALU and transcendental push/pop encoding.
- SPU / Scalar Slot · Sequencer Slot — scalar-ALU and PC-mutation encoding.
- Memory-Load · Memory-Store · CMEM-Load (PF) — the load/store/IAR slot encodings and the tier-by-slot model.
- Predicate Slot · vcreate_mask / M-Register — the scalar-predicate and vector-mask register files.
- Immediate Slot · Loop / LCC · Sparsity Slot (v5+) — immediate pool, loop counter, and the (open) sparsity slot.
- LLVM-TPU Intrinsic Table — the sibling appendix of
llvm.tpu.*intrinsics that lower onto these opcodes. - Per-Gen Comparison Matrix — the sibling appendix consolidating per-generation ISA deltas.