SCS Scalar Opcode Enumeration
Every opcode value, mask immediate, and bit position on this page was read byte-exactly from
libtpu.soin thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d). Other versions differ.
Abstract
The SCS (SparseCore Scalar) sequencer has three scalar slots per 32-byte bundle — ScsScalarMisc, ScalarAlu1, ScalarAlu0 — each a 27-bit field carrying a 6-bit primary opcode at slot-relative bit +16, landing at absolute bundle bits 127, 154, and 181. This page is the opcode roster companion to the SCS Engine page, which owns the bundle byte layout and the 27-bit slot template; here we enumerate every operation each slot can encode, with its integer opcode value and the encoding form that carries it. The closest familiar analog is an ISA opcode table recovered not from a manual but from the decoder's match predicates: there is no opcode-name string table to read, so each value is reconstructed from the per-op compare immediate.
The roster is recoverable because libtpu emits one C++ type per opcode per generation — asic_sw::deepsea::<pxc>::<gen>::isa::SparseCore<Slot><OpName>Opcode — and each type carries a Matches() const predicate that masks the decoded opcode field out of the instruction struct and compares it against that op's own signature. The cmp/movabs immediate inside Matches() is the opcode value; the slot Encoder::Encode writes the same value back into the bundle via BitCopy(dst, dst_bitoff, src, src_bitoff, nbits) at the corresponding absolute bit. Because the decode predicate and the encode write agree, the Matches() immediate is the authoritative encoding. Every value below was cross-checked against a Matches() predicate in the decompiled gfc (6acc60406) namespace, with vfc (Viperfish) sampled for gen-invariance.
The opcode space is two-level. A 6-bit primary field selects either a concrete op (IntegerAdd=0x0a, BitwiseAnd=0x0e) or an op-class. When it names a class, the concrete op lives in a wider escape field that overlays the slot: control ops in an 11-bit field, register reads in a 17-bit field (ReadRegister* = 0x280..0x28d), config sets in a 16-bit field (Set* = 0x4001..0x4005), divide-push in a 0x16xxxx field. The two ALU lanes share one opcode namespace and differ only in bundle bit position and a handful of lane-exclusive ops; ScsScalarMisc carries the sync-flag / atomic / barrier family (encoded as a 6-bit base + a 5-bit sub-opcode mode) plus an integer-ALU subset, with no FP and no branch. This page documents the three predicate shapes, the per-slot rosters with their integer values, the four class escapes, and the per-generation deltas.
For reimplementation, the contract is:
- The three
Matches()predicate shapes. Form A (flat 6-bit, opcode straddles a word boundary), Form B (composite: 6-bit base + 5-bit sub-opcode mode at struct-relative bit 47 or extended-ALU class at bit 58), Form C (mask-compare against the slot word, used by the ALU lanes). Reproducing the decoder means reproducing these masks. - The flat 6-bit roster, shared across both ALU lanes.
0x0a..0x31integer/FP/bitwise/shift/compare ops, identical values onScalarAlu0andScalarAlu1; only the bundle bit (@181vs@154) and a few lane-exclusive ops differ. - The four class escapes and their value ranges. Control
0x00..0x1d(11-bit), register-read0x280..0x28d(17-bit), config-set0x4001..0x4005(16-bit), divide-push0x160001..0x160002. These values are slot-independent; only the field's bit base changes per lane. - The
ScsScalarMisccomposite encoding. A 6-bit base names a class (Sync0x01, SyncWatch0x02, set-sync0x05, barrier0x07, Atomic0x08, extended-ALU0x00); the 5-bit sub-opcode at struct bit 47 (sync/atomic mode) or bit 58 (extended-ALU) picks the member.
| Slots | ScsScalarMisc (op @127) · ScalarAlu1 (op @154) · ScalarAlu0 (op @181) |
| Primary opcode width | 6 bits, slot-relative @+16 |
| Opcode→mnemonic source | per-op SparseCore<Slot><OpName>Opcode::Matches() compare immediate |
| Predicate shapes | A flat-6 · B composite (base + 5-bit sub) · C mask-compare |
| Class escapes | control 11-bit (0x00..0x1d) · reg-read 17-bit (0x280..0x28d) · config 16-bit (0x4001..0x4005) · divide-push (0x16xxxx) |
| gfc op-form counts | SparseCoreScalarMisc* 82 · ScsScalarMisc* 81 · ScalarAlu0* 78 · ScalarAlu1* 82 |
| Shared ALU namespace | ScalarAlu0 ≡ ScalarAlu1 values; lane differs only by bit position + lane-exclusive ops |
| Gen-invariance | shared op values byte-identical VF/GL/GF (IntegerAdd=0x0a on both) |
| Confidence | CONFIRMED (decompile-anchored) unless a row or callout says otherwise |
NOTE — this page enumerates the opcode roster; the bundle byte layout lives in SCS Engine. The 32-byte bundle, the absolute slot bases (111/138/165), the 27-bit slot template (x0
@+0, ScalarY@+5, x1@+11, opcode@+16, predication header@+22), and the no-check-trailer rule are documented there and are not repeated here. The M-Register Predicate Word page owns the 3-bit/4-bit predication-header overlap that sits above each opcode field.
The Opcode-Predicate Model
Why there is no opcode string table
The SC ISA carries no opcode-name array to stringify. Instead, each opcode is a distinct empty C++ class SparseCore<Slot><OpName>Opcode in the asic_sw::deepsea::<pxc>::<gen>::isa:: namespace, and the disassembler/validator asks each one "is this you?" through Matches(). The match predicate decodes the opcode field(s) from the instruction struct (a 128-bit-plus word array; the scalar slots read words at this+0x10 and this+0x18) and compares against the op's hard-coded signature. The signature is the opcode. This makes the Matches() compare immediate the authoritative opcode→mnemonic source, the SC analog of the TensorCore LLO opcode stringify path.
Three predicate shapes appear across the scalar slots. The shapes matter to a reimplementer because they reveal where the opcode bits physically sit and how a class splits into a base plus a sub-opcode.
Form A — flat 6-bit
The primary opcode straddles a 64-bit word boundary: bit 63 of word this+0x10 is the low bit, bits 0..4 of word this+0x18 are the high bits, masked to 6 bits and compared. ScsScalarMisc IntegerAdd (0x1ebabf00):
// SparseCoreScalarMiscIntegerAddOpcode::Matches (gfc 0x1ebabf00)
// shld $1, word_0x10, word_0x18 → (word_0x18<<1) | (word_0x10>>63)
return ((((word_0x18 << 1) | (word_0x10 >> 63)) & 0x3F) == 0x0A); // opcode 0x0a, 6-bit straddle
(The decompiler renders the straddle as (*((__int128*)this + 1) >> 63) & 0x3F; the __int128 view at +1 is the word pair at +0x10/+0x18.)
Form B — composite (base + sub-opcode)
A 6-bit base names a class, and a 5-bit sub-opcode picks the member. The base is reassembled as (bit63 | low5<<1); the predicate XORs it against the class value, then ANDs out the sub-field and XORs against the member signature, ORs the two halves, and tests for zero. ScsScalarMisc AtomicTileAdd (0x1ebabbe0) — base 8 (Atomic), sub 1:
// SparseCoreScalarMiscAtomicTileAddOpcode::Matches (gfc 0x1ebabbe0)
// base = (word_0x10>>63) + 2*(word_0x18 & 0x1F)
// sub mode field = word_0x10 & 0xF800000000000 (struct-rel bit 47, 5 bits)
return ( (((word_0x10>>63) + 2*(word_0x18 & 0x1F)) ^ 0x08)
| ((word_0x10 & 0xF800000000000) ^ 0x800000000000) ) == 0; // base 8, sub 1
The sync/atomic mode field is at struct-relative bit 47 (0xF800000000000). The extended-ALU class instead uses a 5-bit field at bit 58 (0x7C00000000000000); CountLeadingZeros (base 0x00, sub 14) confirms it: (word_0x10 & 0x7C00000000000000) ^ 0x3800000000000000, and 0x3800000000000000 >> 58 = 14.
Form C — mask-compare (the ALU lanes)
ScalarAlu0/ScalarAlu1 use a single AND-mask + compare against the slot word. ScalarAlu0 IntegerAdd (0x1eb67660):
// SparseCoreScalarAlu0IntegerAddOpcode::Matches (gfc 0x1eb67660)
return (word_0x18 & 0x7E0000000000000) == 0x140000000000000; // 0x140.. >> 53 = 0x0a
The opcode is VAL >> tzcnt(MASK): 0x140000000000000 >> 53 = 0x0a. ScalarAlu1 reads the same opcode value from a different word — its predicates mask *((_DWORD*)this + 6) & 0xFC000000 (a 32-bit word at +0x18, bits 26..31), so AddCbreg (0x1eb7b5a0) tests == 0xCC000000 and 0xCC000000 >> 26 = 0x33. Same opcode value, lane-specific bit position — which is exactly the @181 (Alu0) vs @154 (Alu1) bundle-bit difference.
QUIRK — the encode-side switch numbers ops differently from the
Matches()immediates; trustMatches(). InsideScalarAlu0Encoder::Encode(0x1eb693c0) the dispatchswitch(*(a2+88))uses sequential ODS enum case labels (case0xA→AtomicTile…, case0x13→IntegerAdd), then writes the hardware opcode viaBitCopy(dst, 181, …, 6). The case label is the high-level enum ordinal; theBitCopyvalue (and theMatches()immediate that reads it back) is the silicon opcode. A reimplementer who reads the switch case numbers as opcodes will mis-encode every op. The values on this page are theMatches()/BitCopyhardware values.
Shared ALU Roster (ScalarAlu0 / ScalarAlu1)
The flat 6-bit primary set
The two ALU lanes share one opcode namespace: IntegerAdd=0x0a, BitwiseAnd=0x0e, CompareIntegerEq=0x1e, the FP-compare block 0x2a..0x2f decode to identical values on both lanes. They differ only in (a) the bundle bit the opcode lands at — @181 for ScalarAlu0 (decoded from word this+0x18 bit 53), @154 for ScalarAlu1 (decoded from word this+0x18 bits 26..31) — and (b) a small set of lane-exclusive ops listed below. Values are gen-invariant for shared ops (vfc IntegerAdd is ==0x0a, byte-identical to gfc).
| Opcode | Mnemonic | Class | Lane |
|---|---|---|---|
0x0a | IntegerAdd | integer ALU | Alu0 + Alu1 |
0x0b | IntegerAddWithOverflowCheck | integer ALU | both |
0x0c / 0x0d | IntegerSubtractYX / …WithOverflowCheck | integer ALU | both |
0x0e / 0x0f / 0x10 | BitwiseAnd / BitwiseOr / BitwiseXor | bitwise | both |
0x11 / 0x12 | FloatingPointAdd / FloatingPointSubtractYX | FP ALU | Alu1 |
0x13 | FloatingPointMultiply | FP ALU | Alu0 |
0x14 / 0x15 | Multiply32BitIntegers / Multiply32BitUnsignedIntsReturningHighHalf | integer mul | Alu0 |
0x16 | DivideWithRemainderXY | integer div | Alu0 |
0x17 / 0x18 / 0x19 | LogicalShiftLeft / LogicalShiftRight / ArithmeticShiftRight XByYPlaces | shift | both |
0x1a / 0x1b | MaxOfTwoFloatingPointValues / MinOfTwoFloatingPointValues | FP minmax | both |
0x1c / 0x1d | MaxOfTwoUnsignedIntValues / MinOfTwoUnsignedIntValues | int minmax | both |
0x1e–0x23 | CompareIntegerEq/Ne, CompareSignedIntegerGt/Gte/Lt/Lte | int compare | both |
0x24–0x27 | CompareUnsignedIntegerGt/Gte/Lt/Lte | uint compare | both |
0x28 | CarryOutFromIntegerUnsigned | carry | both |
0x29 | PredicateOr | predicate | both |
0x2a–0x2f | CompareFloatingPoint{Eq,Neq,Gt,Gte,Lt,Lte} | FP compare | both |
0x30 | IsInfOrNan | FP classify | both |
0x31 | ArithmeticShiftLeftXByYPlacesCheckOverflow | shift | both |
0x3e | LogicalShiftLeftOnesXByYPlaces | shift (GF-only) | Alu0 |
The compare blocks are dense: 0x1e..0x27 is the ten integer compares (Eq/Ne then signed Gt/Gte/Lt/Lte then unsigned Gt/Gte/Lt/Lte) and 0x2a..0x2f is the six FP compares. A reimplementer can decode the whole compare region by linear opcode index.
Lane-exclusive primary ALU ops
The lanes are not interchangeable. ScalarAlu1 carries the SMEM load/store, circular-buffer (CBREG), and task-request ops; ScalarAlu0 carries the FP-multiply / integer-multiply / divide ops and (in the escape fields below) the branch/call/convert/divide-push ops.
| Opcode | Mnemonic | Lane |
|---|---|---|
0x01 | ScalarLoadSmemY | Alu1 |
0x02 | ScalarLoadSmemXY | Alu1 |
0x03 | ScalarStoreXToSmemY | Alu1 |
0x09 | DescriptorBasedDma | Alu1 |
0x32 | ScalarStoreXToSmemSumDestAndY | Alu1 |
0x33 | AddCbreg | Alu1 |
0x34 | TaskRequestClearIbuf | Alu1 |
0x35 | WriteCbreg | Alu1 |
0x36 | ReadCbreg | Alu1 |
0x37 | TaskRequest | Alu1 |
0x3c | ScalarStoreCircularBuffer | Alu1 |
0x3d | ScalarLoadCircularBuffer | Alu1 |
AddCbreg=0x33 is confirmed (word6 & 0xFC000000) == 0xCC000000, 0xCC000000 >> 26 = 0x33; TaskRequest=0x37 is == 0xDC000000, 0xDC000000 >> 26 = 0x37. FloatingPointAdd=0x11 exists only as SparseCoreScalarAlu1FloatingPointAddOpcode (0x1eb7b4a0, == 0x44000000, 0x44000000 >> 26 = 0x11); there is no ScalarAlu0FloatingPointAddOpcode type in gfc — the lane asymmetry is structural, not a labeling artifact.
GOTCHA —
0x33/0x37are NOT shared ops; they are Alu1-only. A scheduler that treats the whole0x00..0x3frange as a flat lane-agnostic table will placeAddCbregorTaskRequestinto lane 0, which has no encoder for them. The shared region is the arithmetic/compare block (0x0a..0x31); the memory/CBREG/task ops (0x01..0x09,0x32..0x3d) and the lane-exclusive FP/mul/div ops live on one lane only.
The Four Class Escapes
When the primary 6-bit value names a class, the concrete op lives in a wider escape field that overlays the slot. The escape values are slot-independent (Alu0 and Alu1 use the same numbers); only the field's bit base differs per lane (ScalarAlu0 higher in the struct, ScalarAlu1 lower), mirroring the @181/@154 opcode split.
| Escape | Field width | Value range | Decoded from |
|---|---|---|---|
| Control | 11-bit | 0x00..0x1d | Alu0 word +0x18 / Alu1 word +6 |
| Register-read | 17-bit | 0x280..0x28d | (word3 & 0x7FFFC0000000000) Alu0 |
| Config-set | 16-bit | 0x4001..0x4005 | (word3 & 0x7FF03E000000000) Alu0 |
| Divide-push | 0x16xxxx | 0x160001..0x160002 | (word3 & 0x7E003E000000000) Alu0 |
Control ops (11-bit escape)
These are the branch/call/fence/convert ops. The escape field sits above the 6-bit primary (Alu0 struct bit 48 / Alu1 bit 21). Halt (0x1eb67500) confirms the base: (word15 & 0x7FF) == 0 → control 0x00. BranchAbsolute (0x1eb67d40): (word3 & 0x7FF000000000000) == 0x4000000000000, 0x4000000000000 >> 48 = 0x04.
| Opcode | Mnemonic | Opcode | Mnemonic |
|---|---|---|---|
0x00 | Halt | 0x0d | MoveY |
0x02 | PopDrf | 0x0e | CountLeadingZeros |
0x03 | Delay | 0x0f | Ceiling |
0x04 | BranchAbsolute | 0x10 | Floor |
0x05 | BranchRelative | 0x18 | BranchRelativeRotatingPreg (GF, Alu0) |
0x06 | CallAbsolute | 0x1a | ScalarFenceSelect |
0x07 | CallRelative | 0x1c | ScalarFenceStreamHbm |
0x09 | ScalarFence | 0x1d | ScalarFenceStreamSpmem |
0x0b | ConvertInt32ToFloat32 | ||
0x0c | ConvertFloat32ToInt32 |
ScalarAlu1 adds three control ops to this set: 0x14 ReadDreg, 0x15 WriteDreg, 0x1b MoveCbreg. BranchSreg/CallSreg appear as ScalarAlu0 6-bit-field forms (values 4/5) distinct from the absolute/relative branches above.
Register-read ops (17-bit escape)
The chip hardware-register reads that gate SCS progress. ReadRegisterLccLow (0x1eb67560): (word3 & 0x7FFFC0000000000) == 0xA000000000000, 0xA000000000000 >> 42 = 0x280.
| Opcode | Mnemonic | Opcode | Mnemonic |
|---|---|---|---|
0x280 | ReadRegisterLccLow | 0x288 | ReadRegisterTracemark |
0x281 | ReadRegisterLccHigh | 0x289 | ReadRegisterTileid |
0x282 | ReadRegisterGtcLow | 0x28a | ReadRegisterTaskBitmap |
0x283 | ReadRegisterGtcHigh | 0x28b | ReadRegisterFenceStatus |
0x286 | ReadRegisterSparseCoreId | 0x28c | ReadRegisterDifDepthRegister |
0x287 | ReadRegisterTag | 0x28d | ReadRegisterDmaCreditRegister |
Config-set ops (16-bit escape)
The DMA-credit / tag / filter / throttle writes. SetDmaCredit (0x1eb67ac0): (word3 & 0x7FF03E000000000) == 0x8006000000000.
| Opcode | Mnemonic |
|---|---|
0x4001 | SetTag |
0x4002 | SetIndirectFilterValue |
0x4003 | SetDmaCredit |
0x4004 | SetDmaThrottleSflagRange |
0x4005 | SetRotatingPredicateRegister (GF, Alu0) |
Divide-push ops (Alu0 only)
The two-result divide that pushes quotient or remainder. DivideWithRemainderXYPushQuotient (0x1eb67e60): (word3 & 0x7E003E000000000) == 0x2C0002000000000; the primary part 0x2C0000000000000 >> 53 = 0x16 (the DivideWithRemainderXY base) and the sub part 0x2000000000 >> 37 = 1 give the 0x160001 form.
| Opcode | Mnemonic |
|---|---|
0x160001 | DivideWithRemainderXYPushQuotient |
0x160002 | DivideWithRemainderXYPushRemainder |
QUIRK — register-read and config-set values cannot fit the 6-bit primary; they are wider escape fields. A reimplementer reading "
ReadRegisterGtcLow = 0x282" must place0x282in the 17-bit register-read field, not the 6-bit opcode field (which only spans0x00..0x3f). The primary opcode marks the class (register-read / config-set); the concrete0x28x/0x400xvalue lives in the overlaid escape field above it. The escape fields physically overlap the operand-selector bits of the slot, which is why a register-read op carries no x0/x1 operands.
ScsScalarMisc — the Sync / Atomic Slot
Composite encoding
ScsScalarMisc (op @127) is the sync/atomic engine. It carries no FP and no branch; it holds the sync-flag family that coordinates SC tiles with each other and with the TensorCore, plus an integer-ALU subset duplicated from the ALU lanes. Its opcode space is heavily composite (Form B): a 6-bit base names a class, and a 5-bit sub-opcode — the sync/atomic mode at struct-relative bit 47 (0xF800000000000), or an extended-ALU class at bit 58 (0x7C00000000000000) — picks the exact op.
| Base | Class | Sub field | Members (sub value) |
|---|---|---|---|
0x00 | extended-ALU | bit 58 | CoreInterrupt(0), MoveY(13), CountLeadingZeros(14) |
0x01 | Sync compare-and-set | bit 47 | SyncDone(0), SyncEqual(1), SyncNotEqual(2), SyncGreater(3), SyncGreaterOrEqual(4), SyncLess(5), SyncNotDone(6), SyncEqualOrDone(7), SyncNotEqualOrDone(8), SyncGreaterOrDone(9), SyncGreaterOrEqualOrDone(10), SyncLessOrDone(11) |
0x02 | SyncWatch | bit 47 | SyncWatch{Done,Equal,NotEqual,Greater,GreaterOrEqual,Less,NotDone,EqualOrDone,NotEqualOrDone,GreaterOrDone,GreaterOrEqualOrDone,LessOrDone} (modes 0..11) |
0x03 | SyncWatch escape | bit 58 | SyncWatchWait(0), SyncWatchWaitSelect(1) |
0x04 | SyncWatch escape | bit 58 | SyncWatchEnd(0), SyncWatchEndSelect(1) |
0x05 | set-sync | bit 47 | SetSyncFlag(0), SetSyncDone(1), AddSyncFlag(2) |
0x06 | read-sync | bit 58 | ReadSyncFlag(0), ReadSyncDone(1), ReadSyncPublicAccess(2) |
0x07 | barrier | bit 47 | SyncBarrier(0), SetPOrTState(4) |
0x08 | Atomic | bit 47 | AtomicTile{Write(0),Add(1),WriteSetDone(2),AddSetDone(3),WriteSetDoneInverted(4),AddSetDoneInverted(5)}, AtomicRemote{Write(6),Add(7),WriteSetDone(8),AddSetDone(9),WriteSetDoneInverted(10),AddSetDoneInverted(11)} |
SetSyncFlag (base 0x05) is confirmed by ((base) ^ 5 | (word & 0xF800000000000)) == 0 (sub 0); AtomicTileAdd by base ^8 / sub ^0x800000000000 → base 8, sub 1; CoreInterrupt is the all-zero opcode (base 0, extended-ALU sub 0); CountLeadingZeros is base 0, extended-ALU sub 0x3800000000000000 >> 58 = 14.
The Atomic base 0x08 sub-field is a small product: {Tile, Remote} × {Write, Add} × {plain, SetDone, SetDoneInverted}, laid out so the low bit selects Add-vs-Write, the next pair selects the set-done modifier, and +6 switches Tile→Remote.
Flat 6-bit ops mirrored from the ALU set
ScsScalarMisc also carries flat Form-A 6-bit ops, an integer-ALU subset plus sync-state reads and trace ops:
| Opcode | Mnemonic | Opcode | Mnemonic |
|---|---|---|---|
0x0a | IntegerAdd | 0x27 | CompareUnsignedIntegerLte |
0x0b | IntegerAddWithOverflowCheck | 0x28 | CarryOutFromIntegerUnsigned |
0x0c / 0x0d | IntegerSubtractYX / …WithOverflowCheck | 0x29 | PredicateOr |
0x0e / 0x0f / 0x10 | BitwiseAnd / BitwiseOr / BitwiseXor | 0x2a | ReadSyncStateValue |
0x17 / 0x18 / 0x19 | shift XByYPlaces (LSL/LSR/ASR) | 0x2b | ReadSyncStateDone |
0x1c / 0x1d | MaxOfTwoUnsignedIntValues / MinOf… | 0x2d | SetTracemark |
0x1e–0x23 | int compare Eq/Ne + signed Gt/Gte/Lt/Lte | 0x2e | Trace |
0x24–0x26 | unsigned compare Gt/Gte/Lt | 0x2f | SetSyncFlagPublicAccess |
0x31 | ArithmeticShiftLeftXByYPlacesCheckOverflow | 0x38 | SmemFetchAndAdd |
IntegerAdd=0x0a is confirmed ((word>>63) & 0x3F) == 0x0a. The integer-ALU subset is exactly the lane-agnostic arithmetic/compare block — the Misc slot omits FP, multiply, divide, branch, and SMEM load/store, which is what makes it the dedicated sync/atomic + light-integer lane.
GOTCHA —
ScsScalarMischolds two parallel op-form families in gfc. The binary emits bothSparseCoreScalarMisc<Op>Opcode(82 forms) andSparseCoreScsScalarMisc<Op>Opcode(81 forms), with byte-identicalMatches()predicates for the same op (e.g.SyncEqualdecodes base 1 / mode 1 in both). They are the same hardware opcodes under two type spellings; the slot encoderSparseCoreScsScalarMiscEncoder(0x1eb914a0) takes aSparseCoreScalarMiscargument. A reimplementer needs one roster, not two — the duplicate type set does not double the encoding space.
Per-Generation Deltas
The scalar ISA is gen-invariant for shared ops (vfc IntegerAdd decodes ==0x0a, byte-identical to gfc); the deltas are small and concentrated in halt/yield and the rotating-predicate ring. The presence claims below are confirmed by the existence (or absence) of the corresponding Matches() type in each gen namespace: vfc SparseCoreScalarAlu0HaltYieldOpcode exists, the gfc one does not; gfc SparseCoreScalarAlu0SetRotatingPredicateRegisterOpcode exists, the vfc one does not.
| Aspect | Viperfish (vfc) | Ghostlite (glc) | 6acc60406 (gfc) |
|---|---|---|---|
| Primary opcode width | 6-bit | 6-bit | 6-bit |
| Opcode bundle bits | 127 / 154 / 181 | identical | identical |
ScsScalarMisc op-forms | ~100 | (transitional) | 82 |
ScalarAlu0 op-forms | 79 | (transitional) | 78 |
ScalarAlu1 op-forms | 84 | (transitional) | 82 |
IntegerAdd value | 0x0a | 0x0a | 0x0a |
| VF-only ops | HaltYield, HaltYieldConditional, ReadRegisterYieldRequest, ScalarFenceScmf | — | — |
| GF-only ops | — | — | BranchRelativeRotatingPreg, LogicalShiftLeftOnesXByYPlaces, SetRotatingPredicateRegister, MoveCbreg, ScalarStoreXToSmemSumDestAndY |
NOTE — 6acc60406 simplified the sync model. The VF/GL
ScsScalarMisccarries a dual-channel sync family (Set{Both,Other}Sync*,Add{Both,Other}SyncFlag) and aYieldable*sync set; the gfc roster drops both (down to 82 forms) and adds the singleSetPOrTState. The interpretation is a non-yielding tile scheduler — fewer sync primitives, deterministic latency, driven by the rotating-predicate ring instead. A reimplementer targeting 6acc60406 must not emit theYieldable*or*Both*/*Other*sync ops or the VF halt/yield ops; they have no encoder type in gfc.
Function Map
All addresses are gfc (6acc60406) unless noted; the Matches() immediate is the authoritative opcode value.
| Symbol | Address | Opcode evidence |
|---|---|---|
SparseCoreScalarMiscIntegerAddOpcode::Matches | 0x1ebabf00 | Form A ((w>>63)&0x3F)==0x0a |
SparseCoreScalarMiscAtomicTileAddOpcode::Matches | 0x1ebabbe0 | Form B base 8 / sub 1 |
SparseCoreScalarMiscSyncEqualOpcode::Matches | 0x1ebab320 | Form B base 1 / mode 1 |
SparseCoreScalarMiscSetSyncFlagOpcode::Matches | (gfc) | Form B base 5 / sub 0 |
SparseCoreScalarMiscCoreInterruptOpcode::Matches | (gfc) | Form B base 0 / ext-ALU sub 0 |
SparseCoreScalarMiscCountLeadingZerosOpcode::Matches | (gfc) | Form B base 0 / ext-ALU sub 14 |
SparseCoreScalarAlu0IntegerAddOpcode::Matches | 0x1eb67660 | Form C (w&0x7E0…)==0x140… → 0x0a |
SparseCoreScalarAlu0BitwiseAndOpcode::Matches | (gfc) | Form C → 0x0e |
SparseCoreScalarAlu0CompareIntegerEqOpcode::Matches | (gfc) | Form C → 0x1e |
SparseCoreScalarAlu0HaltOpcode::Matches | 0x1eb67500 | control escape (w15&0x7FF)==0 → 0x00 |
SparseCoreScalarAlu0BranchAbsoluteOpcode::Matches | 0x1eb67d40 | control escape → 0x04 |
SparseCoreScalarAlu0ReadRegisterLccLowOpcode::Matches | 0x1eb67560 | 17-bit escape → 0x280 |
SparseCoreScalarAlu0SetDmaCreditOpcode::Matches | 0x1eb67ac0 | 16-bit escape → 0x4003 |
SparseCoreScalarAlu0DivideWithRemainderXYPushQuotientOpcode::Matches | 0x1eb67e60 | divide-push → 0x160001 |
SparseCoreScalarAlu0SetRotatingPredicateRegisterOpcode::Matches | 0x1eb67b00 | config escape, GF-only |
SparseCoreScalarAlu1FloatingPointAddOpcode::Matches | 0x1eb7b4a0 | Form C ==0x44000000 → 0x11, Alu1-only |
SparseCoreScalarAlu1AddCbregOpcode::Matches | 0x1eb7b5a0 | Form C (w6&0xFC000000)==0xCC000000 → 0x33 |
SparseCoreScalarAlu1TaskRequestOpcode::Matches | 0x1eb7b620 | Form C ==0xDC000000 → 0x37 |
SparseCoreScalarAlu0Encoder::Encode | 0x1eb693c0 | opcode BitCopy(.,181,.,6); escapes @176/170/165 |
SparseCoreScsScalarMiscEncoder::Encode | 0x1eb914a0 | opcode BitCopy(.,127,.,6); pred @133/136/137 |
BitCopy | 0x1fa0a900 | LE packer (dst, dst_bitoff, src, src_bitoff, nbits) |
Cross-gen anchors: vfc SparseCoreScalarMiscIntegerAddOpcode::Matches 0x1e8ff7c0 decodes ==0x0a (gen-invariance); vfc SparseCoreScalarAlu0HaltYieldOpcode::Matches 0x1ee81460 exists (VF-only). The TensorCore* and SparseCoreTec* op-form types share the gfc isa namespace — match the SparseCoreScalar/SparseCoreScsScalar prefix exactly to avoid pulling TC or TEC predicates into the SCS scalar roster.
Considerations
- Opcode source of truth is the
Matches()immediate, not the encode-side switch. TheEncoder::Encodedispatch uses sequential ODS enum ordinals asswitchcase labels; only theBitCopyvalue it writes (and theMatches()value that reads it back) is the silicon opcode. Decode and encode agree on theMatches()value; the switch case number is an internal ordinal. - No FP and no branch in the Misc slot.
ScsScalarMiscis sync/atomic + integer-ALU only. FP arithmetic, FP compare, branches, calls, multiply, divide, and SMEM load/store live in the ALU lanes. A scheduler must not place a sync op in an ALU lane or an FP/branch op in Misc. - Lane asymmetry is structural.
ScalarAlu1owns SMEM load/store, CBREG, Dreg, FP add/sub, and task-request;ScalarAlu0owns branch/call/convert/divide-push and FP/integer multiply. The shared arithmetic/compare block0x0a..0x31decodes identically on both, but the lane-exclusive ops have an encoder type on one lane only. - The composite sub-opcode absolute bundle bit is partly inferred (HIGH / LOW). The 6-bit primary opcode bundle bit is confirmed (
@127/@154/@181). The compositeScsScalarMiscsub-opcode field is recovered as a struct-relative offset (sync/atomic mode at bit 47, extended-ALU class at bit 58); its absolute bundle bit (slot base 111 + within-slot offset, per the SCS Engine 27-bit template) is not pinned for every one of the ~50 composite Misc ops individually. Decode each by primary opcode + the struct-relative sub field; the absolute-bit map for the sub field is a remaining gap. FloatingPointMultiply=0x13and the FP minmax/compare values are HIGH, not CONFIRMED. The integer/bitwise/shift/compare and the lane-exclusiveAddCbreg/TaskRequest/FloatingPointAddvalues were read from theirMatches()immediates directly; the FP-multiply, FP-minmax (0x1a/0x1b), and FP-compare (0x2a..0x2f) values are taken from the op-form roster ordering and a sampled subset, not a full per-op immediate sweep.
Related Components
| Name | Relationship |
|---|---|
SparseCoreScsScalarMiscEncoder::Encode (0x1eb914a0) | writes the ScsScalarMisc opcode @127 and predication header |
SparseCoreScalarAlu0Encoder::Encode (0x1eb693c0) | writes the ScalarAlu0 opcode @181 and the escape fields |
BitCopy (0x1fa0a900) | the LE packer every slot encoder uses to write the opcode bits |
per-op SparseCore<Slot><OpName>Opcode::Matches() | the opcode→mnemonic source — one type per opcode per gen |
Cross-References
- SCS (Scalar) Engine — the 32-byte bundle, the slot bases (111/138/165), and the 27-bit scalar slot template this roster's opcode field sits in.
- Vector Opcode Enumeration — the TEC vector-slot opcode roster (VectorAlu / Load / Store / Result / Extended); the vector-side analog of this page.
- TAC Engine — the tile-fetch DMA issuer (VF/GL) that reuses the SCS scalar-lane bundle layout for its Dma/Stream forms.
- SparseCore Overview — the three engine classes, per-gen presence, and the
TpuSequencerTypecodec-template enum. - M-Register Predicate Word — the 3-bit/4-bit predication header that overlays each scalar slot above the opcode field.
- CBREG Circular-Buffer Register — the circular-buffer registers driven by the Alu1
AddCbreg/ReadCbreg/WriteCbreg/MoveCbregops enumerated here. - Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part IX — SparseCore & BarnaCore / SparseCore ISA — back to index