Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

LloOpcode Enum

Every offset, value, and address on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). Other versions differ.

Abstract

xla::jellyfish::LloOpcode is the in-memory opcode enum of LLO (Low-Level Optimizer IR), the TPU-specific late compiler IR that sits below MHLO/TLP and above the per-generation TensorCore bundle encoders. It is a dense, zero-based enumeration: the binary spells exactly 461 enumerators, values 0x000..0x1CC (0..460), with no gaps. Every value carries a k-prefixed name in a relocated char* table; the names cleanly sort into functional families — sequencer/sync, scalar SPU arithmetic, vector VPU arithmetic, EUP transcendentals, cross-lane/reduction/XLU, the MXU matmul/latch/transpose/result group, load/store, a 40-opcode DMA block, predicate/mask, pseudo nodes, and a 33-opcode BarnaCore (SparseCore) block.

LloOpcode is distinct from three sibling index spaces a reimplementer must not conflate. It is not the LloOpcodeProto wire enum (1-based, max value 499, 38 reserved gaps — see LloOpcode↔Proto). It is not the GhPerf::Instruction cost-grid enum (a denser 0..0x1DB space where 19 rows are shared across opcodes). And it is not the MC MCInst opcode space the MC-Emitter dispatches on (offset by +499, gating opcode <= 0x1F2 and indexing 4*opc-1996). LloOpcode is the one all the others map from.

This page is a structured reference catalog, not an algorithm trace. It groups the 461 values by family, gives each family its value range and representative members, names the per-family classifier the binary uses, and flags the per-generation additions (F8 conversions, stochastic rounding, S4/U4 matmul, dual matrix staging). The exhaustive 461-row dump lives in the appendix; here the goal is the shape of the space.

Enumxla::jellyfish::LloOpcode — 461 enumerators, dense, 0x000..0x1CC
Name accessorLloOpcodeName(LloOpcode) @ 0x1d631280 — bound >= 0x1CDud1
Name tableopcode_name @ 0x21ccfef0 (.data.rel.ro, 461 × char*, R_X86_64_RELATIVE)
Property wordopcode_info @ 0x223a1320 (461 × uint16) — Push/Pop/Remat/Fold/Cse + reg-file class
Descriptoropcode_info_big @ 0x227b5570 (461 × 28 B) — result-FIFO + arch-register lists
Family classifiersLloOpcodeIsVector @ 0x1d60c1c0, LloOpcodeIsScalar @ 0x1d60c7e0 (= !IsVector), LloOpcodeIsVectorUnop/Binop/Load/Store, …
ConfidenceCONFIRMED (byte-anchored) unless a row says otherwise

How The Enum Is Stored And Named

LloOpcode is an ordinary C++ scoped enum; nothing in the binary stores it as a class. The single piece of reflection is LloOpcodeName, which converts a value to its string:

// xla::jellyfish::LloOpcodeName @ 0x1d631280 (decompiled, exact)
std::string LloOpcodeName(uint32_t opcode) {
    if (opcode >= 0x1CD)                              // bound = 461 enumerators
        __builtin_trap();                             // ud1 — out of enum range
    const char *s = opcode_name[(uint16_t)opcode];    // opcode_name @ 0x21ccfef0
    return std::string(s, strlen(s));                 // SSO or heap copy
}

The bound 0x1CD is the contract: the valid domain is 0x000..0x1CC inclusive, 461 values. opcode_name is a char* table in .data.rel.ro; each slot is stored as 0 in the file and filled by an R_X86_64_RELATIVE relocation pointing into .rodata. All 461 slots are relocated and all 461 strings are non-empty. This same 461-bound appears verbatim in every consumer of the two per-opcode metadata tables (cmp opcode, 0x1CE; jae <fatal> — the < 0x1CE form), and the metadata tables (opcode_info, opcode_info_big) are each sized 461 × stride.

QUIRK — the enum has 461 members, not 462. The "462" figure the ISA overview quotes is the nominal member count of the LloOpcodeProto wire enum: 1-based with a value-0 sentinel, so 461 live (mappable) wire values + 1 sentinel = 462 nominal. Its addressable range is wider still (max 499, with 38 reserved gaps; 499 − 38 = 461 live — see LloOpcode↔Proto). The in-memory LloOpcode a reimplementer's compiler manipulates is exactly 461 dense values (LloOpcodeName bound 0x1CD, verified at 0x1d631280). Drive a switch off 462 and the last index reads past the table; off the proto's 499 and you index garbage.

Two metadata tables ride alongside the enum

Every opcode is indexed in lock-step into two parallel per-opcode tables the LLO scheduler and optimizer read:

TableAddressStrideCarries
opcode_info0x223a13202 B (uint16)LOW byte = property bitfield (bit0 Push / bit1 Pop / bit4 Remat / bit5 Fold-const / bit6 Cse / bit7 pred-mask tag); HIGH byte = register-file class (0 none/pred, 1 scalar/mask, 2 vector)
opcode_info_big0x227b557028 Bint8 result_fifos[8] @+0x00 (neg-terminated, ResultFifo 0..0x18), int8 arch_registers_read[12] @+0x08 (neg-terminated, ArchRegister 1..0x32), int8 arch_registers_written[8] @+0x14 (neg-terminated, ArchRegister 1..0x32)

Both are indexed by the raw LloOpcode value with the same < 0x1CE bound. They are documented in their own pages; this page references them only to anchor each family's scheduler behavior (which opcodes push/pop result FIFOs, which are CSE-able, which write registers).


Family Taxonomy

The binary does not store a family tag per opcode; family membership is recovered from (a) the k-prefix of each name, (b) the dense classifier switches (LloOpcodeIsVector, LloOpcodeIsVectorUnop, LloOpcodeIsVectorLoad, …), and (c) the opcode_info register-file-class byte. The table below is the at-a-glance map of the eleven families; the per-family sections that follow give ranges and representatives.

FamilyValue range (mostly contiguous)CountReg-file (typical)Primary classifier
Sequencer / sync / FIFO transfer0x000..0x030 (interleaved)~50scalar/none
Scalar SPU arithmetic & controlscattered, dense tail 0x16B..0x1AA63scalar!LloOpcodeIsVector
Vector VPU arithmetic & logicdense 0x11B..0x1A2 core(subset of 295)vectorLloOpcodeIsVectorUnop/Binop
Vector convert / pack / unpack0x05B..0x076, 0x107..0x10F, 0x126..0x127~45vector(convert-prefix)
EUP transcendentals0x128..0x14D38vector (EUP FIFO)(Tanh/Pow2/Recip/Log2/Rsqrt/Sig/Sin/Cos/Erf × F32/Bf16 × {,AndPop})
Cross-lane / reduce / XLU result0x0F5..0x101, 0x14E..0x155~28vector(Reduce/Result prefix)
MXU matmul / latch / matprep / matres0x08D..0x0AB, 0x152..0x153~33vectorLloOpcodeIsVectorMatprep*, matmul band 0x8D..0xA5
Load / store / IAR / RNG0x001..0x004, 0x030..0x046, 0x077..0x078~30mixedLloOpcodeIsVectorLoad @ 0x14024900, …IsVectorStore @ 0x14024920
DMA0x0B3..0x0DA (contiguous)40none(Dma prefix)
Predicate / mask0x0E1..0x0F1, 0x193..0x199~18predicate/mask(Predicate/Mask prefix)
Constants / pseudo / call0x0DB..0x0F4, 0x02C..0x02E, 0x17C~25mixed(Constant/Phi/Pseudo/Call prefix)
BarnaCore (SparseCore)0x1AC..0x1CC (contiguous)33mixed(BarnaCore prefix)

NOTE — vector-vs-scalar is a per-opcode partition, not a range split. LloOpcodeIsScalar is literally LloOpcodeIsVector(op) ^ 1 (@ 0x1d60c7e0), and LloOpcodeIsVector (@ 0x1d60c1c0) is a dense switch that returns 1 for vector opcodes and 0 for the rest — but the two sets interleave across the value space. For example kEvent (0), kLog (5), kHloStart/kHloEnd (6/7) are non-vector; kVectorReadIar (1) is vector; the BarnaCore vector load/store ops (457..460) are vector while the BarnaCore scalar ops (428..456) are not. A reimplementer must port the switch, not a >= threshold test.


Sequencer, Sync, and FIFO-Transfer Family (0x000..0x030)

The low opcodes are the control-and-handshake layer: program boundaries, scheduling barriers, fences, and the scalar/vector/sync-flag result-FIFO push/pop primitives that move values between the SPU, VPU, and the cross-FIFO (CCF) staging registers.

ValueNameRole
0x000kEventprogram/trace event marker
0x006 / 0x007kHloStart / kHloEndsource-HLO span markers (debug-info)
0x008kSchedulingBarrierhard reorder barrier (vetoes CSE across it)
0x009kScalarToVectorscalar→vector broadcast push
0x00A..0x012kVectorToScalarPushkDrfPopV→S / sync-flag→scalar FIFO push/pop quartet
0x013..0x014kScalarFence / kVectorStoreFenceordering fences
0x015..0x01CkScalarCcfPushkVectorCcfPopAsymmetricalcross-core FIFO (CCF) push/pop, symmetric + asymmetric
0x01DkMegacoreSwapCoresPseudomegacore core-swap pseudo
0x01EkCmemFenceCMEM ordering fence (Pufferfish+)
0x01F..0x024kScalarReadCycleStartkScalarReadCycleLowcycle-counter reads / 64-bit splits
0x025..0x027kScalarHalt / …YieldConditional / …OnErrorsequencer halt variants
0x029kProgramLaunchScSparseCore program launch
0x02BkVectorInterruptvector-side interrupt

The static opcode_info low-byte push/pop bits do not line up with the *Push/*Pop opcode names the way one would expect, so FIFO behavior must be taken from the helper functions rather than read off the table by name. As decoded from the on-disk opcode_info (base 0x223a1320, uint16 stride): kVectorToScalarPush (0x0A), kSyncFlagToScalarPush (0x0B), and kSyncFlagToSfrfPush (0x0F) all read 0x0000 (no bit0); whereas kSfrfPop (0x11) reads 0x0073 and kDrfPop (0x12) reads 0x0002 (bit1 set on both). LloInstructionPushesToResultFifo (opcode_info[op] & 1) and LloInstructionPopsFromResultFifo (matres special-case + bit-field extract) are the authoritative push/pop tests. kScalarHalt/kScalarHaltOnError share GhPerf cost row 0x000.


Scalar SPU Family

Scalar opcodes are the complement of the vector set. They cluster in two places: a few address/compose/branch/select members in the 0x085..0x089 band, and a dense arithmetic-and-shift tail 0x16B..0x1AA. The register-file-class byte for these is 1 (scalar/mask).

ValueNameRole
0x085kScalarComposeU64pack two u32 into a u64
0x086kScalarAddressCalculationaddress arithmetic (shares GhPerf 0x00C with kScalarAddS32)
0x087/0x088kScalarBranchRel / kScalarBranchIndrelative / indirect branch
0x089kScalarSelectscalar conditional select
0x16B..0x16CkScalarCompare / kScalarAddCarryU32compare + add-with-carry
0x16D..0x172kScalarMultiplyWordAddrkScalarAddS32multiplies + adds (u24/u32/f32/s32)
0x173..0x177kScalarSubtractS32kScalarBitwiseXorsub + bitwise and/or/xor
0x178..0x17AkScalarDivRemU32 / kScalarDivU32AndPop / kScalarRemU32AndPopdivide/remainder (data-format-special-cased in FIFO analyses)
0x17BkScalarMovescalar copy — Move-exclusion opcode (never CSE'd/rematted)
0x17D..0x17FkScalarFloorF32 / kScalarCeilF32 / kScalarCountLeadingZerosscalar rounding + CLZ
0x1A3..0x1A6kScalarShrlkScalarShllOneslogical/arith shifts
0x1A7..0x1AAkScalarMinimumF32kScalarMaximumU32min/max (f32/u32)

Vector VPU Arithmetic & Logic Family

The vector ALU is the largest family. Its arithmetic core is the contiguous 0x11B..0x1A2 block; clamp/accumulate helpers sit earlier at 0x048..0x05A. LloOpcodeIsVectorUnop (@ 0x1d60c200) and LloOpcodeIsVectorBinop (@ 0x1d60c680) partition unary vs binary forms. The register-file byte is 2 (vector); the foldable/CSE-able bits (opcode_info bit5/bit6) are set on the pure-functional members.

ValueNameRole
0x048..0x050kVectorClampGezF32kVectorRemapBf16clamp / remap (gez, symmetric, asymmetric; F32/Bf16/S4)
0x051..0x05AkVectorMultiplyAccumulatekVectorMoveEvenAccLowMAC + accumulator moves (FOLD bit set on MAC family)
0x11B..0x124kVectorAddS32kVectorSubtractS16add/subtract (s32/f32/bf16/s16, plus Bf16-high/low)
0x156..0x15BkVectorPowF32kVectorMultiplyBf16pow + multiply (F32/U32/U16/Bf16)
0x15C..0x15FkVectorAndU32kVectorXorU32bitwise and / and-negated / or / xor
0x162..0x166kVectorMultiplyComposeU64kVectorExtractHigh3264-bit multiply compose + word extracts
0x180..0x184kVectorCountLeadingZeroskVectorExtractSignificandCLZ / move / popcount / FP field extracts
0x19A..0x19CkVectorShiftRightLogicalkVectorShiftLeftLogicalvector shifts
0x19D..0x1A2kVectorMaximumF32kVectorMinimumU32min/max (F32/Bf16/U32)

kVectorMove (0x181) carries the CSE/remat bits but is an unconditional Move-exclusion opcode like kScalarMove — the LLO CSE/remat passes never fire on it.


Convert / Pack / Unpack Family

Type conversions are spread across three bands: the f32→{s32,f8,hf16} and {s8,s4,u8,u4}↔bf16 block at 0x05B..0x076, the unpack/round/truncate block at 0x107..0x11A, and pack/compose at 0x125..0x127. These are the opcodes that gained the most across generations.

ValueNameRole
0x05B..0x060kScalarConvertF32ToS32WithProbRoundingkVectorConvertF32ToS32TowardsZeroPseudoF32→S32 (prob-round / towards-zero)
0x061..0x064kVectorConvertF32ToF8E5M2 / …E4M3Fn / …E4M3B11 / …ToHf16F32→F8 / Hf16 (F8 added Pufferfish+)
0x066..0x06DkVectorConvertS8ToBf16kVectorConvertBf16ToU4int↔Bf16 (s8/s4/u8/u4; S4/U4 Pufferfish+)
0x06E..0x06FkVectorConvertEXMYToE4M3 / …ToE5M2generic FP8 reformat
0x070..0x074kVectorConvertF32ToE5M2StochastickVectorConvertF32ToHf16Stochasticstochastic-rounding converts (Viperfish+)
0x109..0x10FkVectorUnpackkVectorDynamicUnpackunpack + B2→B4 / B4→B8 join + EXMY/dynamic
0x110..0x11AkVectorCeilF32kVectorTruncateBf16round-to-int / RTNA / RTNE / truncate
0x125..0x127kVectorComposeF32 / kVectorPack / kVectorPackEXMYcompose + pack

EUP Transcendental Family (0x128..0x14D)

The Extended Unit Pipeline computes transcendentals as a deferred-result pipeline: nine functions (Tanh, Pow2, Reciprocal, Log2, Rsqrt, SigShft, Sinq, Cosq, Erf, plus the standalone PushErf) each appear in four forms — F32, Bf16, F32AndPop, Bf16AndPop. The *AndPop variants both issue the transcendental and pop its result from the EUP FIFO in one opcode; the bare forms only issue (bit0 Push set), with the result collected later by kVectorEupResult (0x14E).

Value bandMembersForm
0x128..0x131Tanh/Pow2/Reciprocal/Log2/Rsqrt/SigShft/Sinq/Cosq/Erf + kVectorPushErfF32, issue-only
0x132..0x13Asame setBf16, issue-only
0x13B..0x144same set + kVectorPushErfAndPopF32, issue+pop
0x145..0x14Dsame setBf16, issue+pop

NOTE — EUP FIFO push/pop is not read from a static opcode_info bit. LloInstructionPushesToResultFifo (@ 0x1d4f3600) tests opcode_info[op] & 1, and LloInstructionPopsFromResultFifo (@ 0x1d4f3720) special-cases the matres band (0x152/0x153) and otherwise extracts a sign-bit field (shl 0xC; cwde; sar 0xD) from the property word — it does not test a fixed Pop bit. The static opcode_info slot for kVectorEupResult (0x14E) reads 0x0000 on disk (file offset 0x223a15bc), so a reimplementer must drive the EUP FIFO from the push/pop helpers (and their matres/EUP special-cases), not from a literal property-word constant. The *AndPop opcodes fuse issue+pop into one instruction.


Cross-Lane, Reduction, and XLU-Result Family

Reductions (0x0F5..0x101) compute min/max/add/argmin/argmax across lanes or segments in F32 and Bf16; the result-collection opcodes (0x14E..0x155) pop the various deferred-result FIFOs (EUP, cross-lane, permute, CMEM, transpose). The cross-lane permute/rotate/broadcast primitives live earlier at 0x036..0x03B.

ValueNameRole
0x0F5..0x0FCkVectorMinReduceF32kVectorAddSegmentReduceF32F32 reductions (whole + segmented + index)
0x0FD..0x101kVectorMinReduceBf16kVectorMinIndexReduceBf16Bf16 reductions
0x102..0x106kVectorSublaneIdkVectorLaneSequenceInterleavedB16lane/sublane identity sequences (remat-able constants)
0x14EkVectorEupResultpop EUP result FIFO (Pop bit)
0x14F..0x151kVectorXlaneResult / kVectorPermuteResult / kVectorCmemResultpop XLU / permute / CMEM result FIFOs
0x152..0x153kVectorMatres / kVectorMatresAddpop MXU result (plain / accumulate)
0x154..0x155kVectorTransposeResult / kVectorTransposeClearpop transpose result / clear transpose FIFO

kVectorXlaneResult/kVectorPermuteResult/kVectorTransposeResult share GhPerf cost row 0x1C7. kVectorTransposeClear (0x155) has opcode_info word 0x0911 (bytes 11 09 at file offset 0x223a15ca).


MXU Family — Matmul, Latch, Matprep, Result

The matrix unit pipeline stages a stationary operand (latch), prepares the moving operand (matprep), issues the matmul, and collects the result (matres). The latch/matmul opcodes (0x08D..0x0A5) are the data-format-dependent band that the FIFO and cost analyses special-case: their result-FIFO behavior and GhPerf cost row are computed from the runtime matmul_data_format / latch_mode, not from a static table read.

ValueNameRole
0x08D..0x096kVectorLatchLsfkVectorLatch3Mskstationary-operand latch (Lsf / 0..3, masked variants)
0x097..0x09AkVectorMatprepSubrkVectorMatprepMubrMskmoving-operand prep (single/multi broadcast, masked)
0x09B..0x0A5kVectorMatmulkVectorMatmulLmrmatmul (Mubr / High / Low / Msk / Packed / Lmr)
0x0A6..0x0A7kVectorTranspose / kVectorTransposeBinaryXLU transpose (vxpose-mode dispatched)
0x0A8..0x0ABkVectorDoneWithGainskVectorLoadLmrWithBf16Conversiongain handshake + GMR/LMR loads
0x152..0x153kVectorMatres / kVectorMatresAddmatmul result collection (also in result family)

GOTCHA — the matmul band is special-cased before the property-word table read. LloInstructionPushesToResultFifo tests the matmul band (0x8D..0xA5) via a bitmask and routes to a matmul_data_format vtable call; only opcodes outside the band fall through to opcode_info[op] & 1. A reimplementer who reads the static Push bit for a matmul opcode gets the wrong FIFO behavior — the matmul's push depends on its data format, decided at the instruction instance, not the opcode.


Load / Store / IAR / RNG Family

Memory access and the per-lane index-address-register (IAR) setup. LloOpcodeIsVectorLoad (@ 0x14024900) and LloOpcodeIsVectorStore (@ 0x14024920) classify the vector forms; the RNG opcodes (0x03C..0x03E) seed and read the per-lane PRNG used by stochastic conversions.

ValueNameRole
0x001..0x004kVectorReadIarkVectorSetIarSublaneread / set index-address register
0x030..0x035kVectorLoadSublaneShufflekVectorCmemLoadAndPopvector + CMEM loads
0x036..0x03BkVectorPermutekVectorBroadcastLanecross-lane permute / rotate / combine / broadcast
0x03C..0x03EkVectorPrng / kVectorSetRngSeed / kVectorGetRngSeedper-lane PRNG
0x03F..0x046kVectorStorekVectorStoreEvenOddSublanesvector + CMEM stores (indexed, masked, shuffle)
0x077..0x078kScalarLoad / kScalarStorescalar memory access
0x047kVectorNopvector no-op (shares GhPerf 0x1B4 with kVectorMaskMove)

DMA Family (0x0B3..0x0DA)

The 40 contiguous DMA opcodes are the largest single contiguous block. They enumerate direction (HBM/VMEM/SMEM/CMEM/HIB/IMEM/Host, plus IOVA-addressed host) crossed with the source/destination-register variants (Vsrc/Vdst/VsrcVdst) and a WithHibUpdate family. The two terminators kDmaDone/kDmaDoneWait close a DMA.

ValueNameRole
0x0B3..0x0B4kDmaGeneral / kDmageneric DMA forms
0x0B5..0x0CCkDmaHbmToVmemkDmaSmemToVmemdirection matrix (HBM/VMEM/SMEM/CMEM/HIB/IMEM)
0x0CD..0x0D0kDmaHbmToVmemWithHibUpdate…VdstWithHibUpdateHBM→VMEM with HIB update
0x0D1..0x0D8kDmaHbmToHostkDmaSmemToHostIovahost DMA (direct + IOVA)
0x0D9..0x0DAkDmaDone / kDmaDoneWaitDMA completion / wait

NOTE — the cost model prices DMA by direction-class, not per-opcode. The 14 HBM/Host-DMA opcodes all collapse to GhPerf cost row 0x40 and the 14 VMEM/SMEM-DMA opcodes to row 0x43. The 40 distinct LloOpcode values are real and must all exist, but the latency model treats each direction family as one. None of the DMA opcodes set a result-FIFO Push/Pop bit (opcode_info LOW byte 0x00); they are side-effecting via memory, not via the FIFOs.


Predicate / Mask Family

Predicate (single-bit, P-register) and vector-mask (lane-mask, VM-register) ops. The register-file class is 0/1 (none/predicate or scalar/mask). The CSE-able + predicate-tag bits (opcode_info 0xC0/0xD0/0xE0) mark this family.

ValueNameRole
0x0E1kPredicateConstantremat-able predicate constant (opcode_info 0x222C, file offset 0x223a14e2)
0x0E5..0x0E8kPredicateNegatekPredicateOrpredicate logic (share GhPerf row 0x032)
0x0E6kPredicateMovepredicate copy — Move-exclusion opcode
0x0E2..0x0E3kVectorMaskConstant / …Packedmask constants
0x167..0x16AkVectorComparekVectorAddCarryU16mask-producing compares + add-carry
0x193..0x199kVectorMaskXorkVectorMaskMovemask logic / pack-compressed / negate / move
0x199kVectorMaskMovemask copy — Move-exclusion opcode

The four Move-exclusion opcodes — kScalarMove (0x17B), kVectorMove (0x181), kVectorMaskMove (0x199), kPredicateMove (0xE6) — form the bitmask 0x40000041 (over base 0x17B) ∪ {0xE6} that the CSE/remat/fold passes use to skip moves. See opcode property word for the gate detail.


Constants, Pseudo, and Call Family

Constants (0x0DB..0x0E4), the Phi/pseudo SSA nodes (0x0E9..0x0F4), and the call/tuple group. Constants set the remat bit (opcode_info 0x50); Phi nodes (0x0E9..0x0EC) short-circuit to cost 0 in the latency model.

ValueNameRole
0x02C..0x02EkAllocationAddress / kParameterAddress / kIntToPtraddress materialization (remat-able)
0x0DB..0x0E0kScalarConstantU32kVectorConstantF32scalar/vector constants (U32/PackedBf16/F32)
0x0E4kVectorConstantU6464-bit vector constant
0x0E9..0x0ECkPredicatePhi / kScalarPhi / kVectorPhi / kVectorMaskPhiSSA phi nodes (cost 0)
0x0ED..0x0F0kTuple / kInlinedCall / kCall / kInlinedCallOperandcall / tuple structure
0x0F1..0x0F4kPredicatePseudokVectorMaskPseudopseudo placeholders per register class
0x17CkRelocatableConstantlink-time relocated constant

BarnaCore (SparseCore) Family (0x1AC..0x1CC)

The 33 highest opcodes are the BarnaCore (SparseCore) instruction set — embedding scatter/gather, sparse reduce, remote scalar writes, and the BarnaCore-local vector load/store/move. These are the opcodes the MC-Emitter actually encodes with real insertBits sequences (its populated InstBits_BarnaCorePxcHwMode table covers this range), in contrast to the TensorCore opcodes which it returns all-zero.

ValueNameRole
0x1AC..0x1B3kBarnaCoreScalarWaitDonekBarnaCoreScalarWaitNescalar wait/sync primitives
0x1B4..0x1B7kBarnaCoreScalarSyncDoneRead…SyncPublicAccessWritesync-flag read/write
0x1B8..0x1BBkBarnaCoreScalarPopkBarnaCoreScalarFencepop / FSM issue / fence
0x1BC..0x1C8kBarnaCoreRemoteScalarWritekBarnaCorePfLocalScatterGradientsremote write / scatter-gather / sparse-reduce (Pf = prefetch variants)
0x1C9..0x1CCkBarnaCoreVectorLoadkBarnaCoreVectorStoreBarnaCore vector load/store/move

QUIRK — the BarnaCore vector load/store opcodes (457..460) test as vector in LloOpcodeIsVector, while the BarnaCore scalar/sync opcodes (428..456) test as non-vector. This is the only place the BarnaCore block straddles the vector/scalar partition, and it matters for register-class selection: kBarnaCoreVectorLoad/…ImmediateOffset/…MoveScalarReg/…VectorStore get the vector register file even though their family prefix is "BarnaCore", not "Vector".


Per-Generation Additions

LloOpcode is gen-invariant in its numbering — the same enum value means the same opcode on every TPU generation — but the valid subset grows with each silicon. The compaction encoders' vtable slot counts track this growth (vxc Viperfish ≈ 403, gxc::glc Ghostlite ≈ 623, gxc::gfc ≈ 674 slots), reflecting more legal (opcode × data-format) combinations on later gens. The codenames below are the binary's own internal strings (Jellyfish, Pufferfish, Viperfish, Ghostlite all appear verbatim in libtpu.so); the newest generation is named only by its hashed family tag 6acc60406 — the marketing names "Trillium"/"Ironwood" have zero byte occurrences in this build.

GenerationCodenameLloOpcode additions
TPU v2Jellyfishbase set (proto-direct encoding, no Compact encoder)
TPU v4PufferfishF8 converts (0x061..0x063), S4/U4 int↔Bf16 (0x067/0x069/0x06B/0x06D), kCmemFence (0x01E), CMEM DMA/load opcodes
TPU v5pViperfishstochastic-rounding converts (0x070..0x074)
TPU v6eGhostlitevector_misc slot ops
TPU7x6acc60406dual matrix staging (MATPUSH target MSRA/MSRB); newest-gen-only opcodes kVectorToScalarPush (0x0A) / kSyncFlagToScalarPush (0x0B) map to the highest GhPerf rows (0x1DA) only valid on the 476-row grid

NOTE — the enum is append-and-insert, not append-only, which is why proto and in-memory numbering diverge. New opcodes are inserted into the in-memory LloOpcode at their family's natural position (keeping families contiguous), but appended to the end of the LloOpcodeProto wire enum (to preserve wire compatibility). The result is the non-monotonic tail of the LloOpcode↔Proto map: proto value 499 (the newest wire slot) maps to in-memory 0x197 (kVectorMaskPackCompressedEven), and proto value 498 maps to 0x084 (kVectorTraceArg).


Cross-References

  • LloOpcode↔Proto — the LloOpcodeToProto / ProtoToLloOpcode wire-value map and the 38 reserved proto gaps.
  • MC-Emitter — the getBinaryCodeForInstr dispatch over the MC opcode space (LloOpcode + 499), which encodes only the BarnaCore subset and returns the TensorCore subset all-zero.
  • InstBits DB — the opcode_info property word, opcode_info_big descriptor, and the Move-exclusion / CSE / remat gates keyed on these opcodes.
  • LLO Opcode Table (appendix) — the exhaustive 461-row value↔name dump.