Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

V5+ EmitX Absolute Bit Positions

All addresses on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so, BuildID 89edbbe81c5b328a958fe628a9f2207d, 781,691,048 B, not stripped, .text VA == file offset). Other builds will differ.

Abstract

Jellyfish (v2) and Pufferfish (v4) build a bundle with a single monolithic encoder — EncoderJf::EncodeBundleInternal (0x1e86c7c0) and EncoderPfTensorCore::EncodeBundleInternal (0x1e8c5c40) — that takes one Bundle object and packs every slot inline. Every V5+ generation in this build (Viperfish/vxc+vfc, Ghostlite/glc, 6acc60406/gfc) abandons that model entirely. There is no EncodeBundleInternal for any V5+ codec. Instead the bundle is produced by a two-stage chain: upstream, an xla::tpu::sparse_core::isa_emitter::EmitX proto-template populates a typed proto sub-message and sets a present bit; downstream, a per-slot <Slot>Encoder::Encode codec reads that proto and writes each field to a fixed absolute bit position in the flat bundle byte buffer via one universal bit-granular packer, BitCopy(dst, dst_bit, src, src_bit, nbits) (0x1fa0a900). This page consolidates those absolute bit positions for the sequencer, immediate, and predicate slots that the per-generation bundle pages defer to it.

The structural consequence is that the LLVM-MC InstBits table — which on a stock LLVM backend holds the fixed opcode bits of each instruction — is all zero on disk for every V5+ generation. There are no fixed instruction bits in a static table; the entire bundle layout is a flat list of (absolute-bit, width) triples distributed across roughly two hundred per-op BitCopy helpers and orchestrated by a single EncodeBundle dispatcher. The set of per-slot BitCopy offsets is the generation's effective kIsaTable. A reimplementer who looks for a static InstBits array, or for EncoderVf::EncodeBundleInternal, will find neither.

The page is organized by the two encode stages, then by slot. The opening BitCopy and EmitImmediate sections establish the primitive and its proto-field convention; the immediate, sequencer, and predicate sections give the absolute (dst_bit, width) maps per generation; the MXU/result/EUP section consolidates the compute-slot positions; the closing per-(slot, gen) table is the single reference the Viperfish, Ghostlite, and 6acc60406 bundle pages cite for the deferred fields.

For reimplementation, the contract is:

  • The two-stage chain: EmitX proto population (Stage 1) → <Slot>Encoder::Encode BitCopy (Stage 2). No monolithic EncodeBundleInternal.
  • The BitCopy(dst, dst_bit, src, src_bit, nbits) calling convention and bit-exact, LSB-first semantics.
  • The EmitImmediate proto-slot map (which of 6 immediate slots a value lands in, and its present bit).
  • The absolute (dst_bit, width) for the sequencer (branch/call/LCC) discriminator, dest/x-target sreg, and predication fields, per generation, for the 32-byte SCS bundle and the 64-byte TC bundle.
  • The per-generation absolute-bit deltas: Viperfish baseline → Ghostlite +3-bit TC shift → 6acc60406 (its own layout, dedicated dual-predicate slot, 2-bit per-slot selector).
Universal packerBitCopy(void* dst, int dst_bit, const void* src, int src_bit, int nbits) @ 0x1fa0a900 (mangled _Z7BitCopyPviPKvii)
Calling conventiondst=rdi, dst_bit=esi, src=rdx, src_bit=ecx, nbits=r8d (System V); bit-granular, LSB-first
Stage-1 emitter namespacexla::tpu::sparse_core::isa_emitter::EmitX<…> (EmitImmediate, EmitBranchOp, EmitCallOp, EmitPredicationToSlot)
Stage-2 codec namespaceasic_sw::deepsea::{vxc,vxc::vfc,gxc::glc,gxc::gfc}::isa::<Slot>Encoder::Encode
OrchestratorEncodeBundle 0x1e838cc0 (6acc60406 codec; TC worker 0x1d371540), EncoderGlTensorCore::EncodeBundle 0x1d331d00 (Ghostlite)
SCS bundle32 B / 256 bit; branch/call offset (imm slot 0) at bit 67, all V5+ gens
TC bundle64 B / 512 bit; branch/call offset (imm slot 0) at bit 430 (vxc) / 433 (glc) / 423 (gfc)
InstBits table0x3366d90 — all zero, no relocations, for every V5+ gen

The Two-Stage Encode Chain

Purpose

Separate what an instruction means (a typed proto sub-message, generation-independent) from where its bits go (a per-generation BitCopy offset map). Stage 1 runs in the front-half emitter and produces a *Bundle proto with a populated sub-message per occupied slot. Stage 2 runs in the codec and serializes that proto into the wire bundle. The split is why the same LLO op encodes to three different bit layouts across Viperfish, Ghostlite, and 6acc60406 with no change to the front-end emitter logic — only the codec's offset literals differ.

Entry Point

LLO MCInst
  └─ isa_emitter::EmitX<Slot, OpKind>      ── Stage 1: populate proto sub-message
       │   sets present bit, writes operand fields
       ▼
  <gen> proto bundle (SparseCoreScsBundle / TensorCoreBundle / …)
       │
  EncodeBundle  0x1e838cc0  (dispatch on TpuSequencerType)
       ├─ case 0 → TensorCoreCodecBase<…>   → TC worker 0x1d371540
       ├─ case 3 → SparseCoreScsCodecBase<…> → EncoderBase::EncodeBundle
       └─ case 5 → SparseCoreTecCodecBase<…> → EncoderBase::EncodeBundle
                       │   walk per-slot Encoders in template-arg order
                       ▼
  <Slot>Encoder::Encode                    ── Stage 2: BitCopy fields → bundle bytes
       └─ BitCopy(buf, dst_bit, &field, 0, width)   0x1fa0a900

Algorithm

Stage 1, immediate population, confirmed from the demangled isa_emitter::EmitImmediate<glc::isa::SparseCoreImmediates> body (0x139f7060):

function EmitImmediate(slot_index, value, immediates_msg):    // 0x139f7060
    if value >= 0x100000:                                     // llvm::isUInt<20> RET_CHECK
        return Error("isa_emitter_base.h:587")                //   (signed 20-bit offsets are pre-masked)
    switch slot_index:                                        // jump on slot 0..5
        case 0: field = msg+0x18; present = 0x01; break       //   imm_0()
        case 1: field = msg+0x1c; present = 0x02; break       //   imm_1()
        case 2: field = msg+0x20; present = 0x04; break       //   imm_2()
        case 3: field = msg+0x24; present = 0x08; break       //   imm_3()
        case 4: field = msg+0x28; present = 0x10; break       //   imm_4()
        case 5: field = msg+0x2c; present = 0x20; break       //   imm_5()
        default: return Error("Invalid immediate: <n>")
    if (msg+0x10 & present) && *field != value:               // RET_CHECK: re-set must match
        return Error("imm_N() == value")
    *field = value
    msg+0x10 |= present                                       // set present bit at msg+0x10
    return Ok

Stage 2, the per-slot encoder, reads each populated sub-message and emits a sequence of BitCopy calls. The encoder body is a switch on the proto opcode discriminator — at proto+0x50 on the Viperfish/Ghostlite scalar encoders and on the gfc TC sequencer, at proto+0x58 on the gfc SCS sequencer and gfc predicates encoder (the gfc proto grows a wider header); each arm is one op family's field layout. The generic shape (from the immediates encoders) is:

function <Slot>Encoder::Encode(this, proto, out_buf):
    scratch[0] = proto.imm_0                                  // stage value into 8-byte scratch
    BitCopy(out_buf, ABS_BIT_0, scratch, 0, WIDTH_0)          // place WIDTH_0 bits at ABS_BIT_0
    scratch[0] = proto.imm_1
    BitCopy(out_buf, ABS_BIT_1, scratch, 0, WIDTH_1)
    ...                                                       // one BitCopy per field
    return Ok

NOTE — EmitImmediate does not know the absolute bit. It only chooses one of six proto fields and sets a present bit. The absolute bit is decided entirely in Stage 2 by the encoder's BitCopy literal. This is why the immediate-slot bit positions differ between SCS (bit 67) and TC (bit 430) for the same logical immediate slot 0 — same proto field, different encoder.


BitCopy — The Universal Packer

Purpose

One function packs every field of every V5+ bundle. There is no per-type serializer; each <Slot>Encoder::Encode is a flat list of BitCopy calls. Recovering the layout of any slot reduces to reading the (dst_bit, nbits) immediate pair preceding each BitCopy call in the named encoder.

Calling Convention and Semantics

BitCopy is _Z7BitCopyPviPKviiBitCopy(void* dst, int dst_bit, const void* src, int src_bit, int nbits). Under System V the arguments land in rdi, esi, rdx, ecx, r8d. The decompiled body (0x1fa0a900) confirms bit-exact, LSB-first behavior:

function BitCopy(dst, dst_bit, src, src_bit, nbits):          // 0x1fa0a900
    if nbits == 0: return                                     //   early-out, then vzeroupper
    dst_byte = dst_bit / 8                                    //   = dst_bit >> 3
    dst_off  = dst_bit & 7                                    //   bit-in-byte
    src_byte = src_bit / 8
    src_off  = src_bit & 7
    // copy nbits from src starting at (src_byte, src_off) into dst at (dst_byte, dst_off),
    // LSB-first, preserving the surrounding bits of any straddled dst byte
    // (vectorized inner loop when the run is >= 24 bits; scalar head/tail otherwise)

The body computes dst_bit / 8 and dst_bit & 7 exactly as written, masks the leading/trailing partial bytes so neighboring fields are untouched, and uses an AVX inner loop for runs of 24 bits or more. Bit 0 is the LSB of byte 0 of the bundle buffer.

QUIRK — because every field is an independent BitCopy into a shared buffer, field order does not matter for correctness as long as windows do not overlap. The encoder writes the predication header first, then dispatches per-op, but a reimplementation may emit the calls in any order. Overlap is possible by design: on 6acc60406 SCS the 3-bit predicate selector and the 4-bit dual-predicate index both start at bit 187 (see Predicate Slot).

Every concrete mov esi,<dst_bit>; mov r8d,<width>; call BitCopy therefore reads directly as "place width bits at absolute bit dst_bit". All positions below were recovered from those immediates in the named encoders.


Immediate Slot Map

Purpose

The branch/call/sync target offset and all other immediates live in the immediate slots, not in any opcode field. A branch's 20-bit signed target offset is immediate slot 0; the opcode discriminator in the sequencer slot only distinguishes absolute/relative/call. This is the home that the per-generation bundle pages defer to this page.

SCS Bundle (32 B / 256 bit)

SparseCoreImmediatesEncoder::Encode reads proto fields a2[6]..a2[11] (proto +0x18..+0x2c, i.e. imm_0..imm_5) and BitCopys each at width 20. Confirmed byte-identical across all three generations:

imm slotproto fielddst_bit (hex)dst_bit (dec)width
0 (branch/call offset)+0x180x436720
1+0x1c0x2f4720
2+0x200x1b2720
3+0x240x07720
4+0x280xd721520
5+0x2c0xc319520

Encoder addresses: vfc 0x1ee75ee0, glc 0x1eb563c0, gfc 0x1ecd1760. Slots 4 and 5 (bits 215/195) appear in the full SparseCoreImmediatesEncoder; the SCS branch/call path uses only slots 0..3. A second class, SparseCoreScalarImmediatesEncoder (gfc 0x1eb5bd20), packs only slots 0..3 at the same bits 67/47/27/7 — it is the encoder the gfc SCS codec template names. The gfc SCS codec template (confirmed in the EncodeBundle 0x1e838cc0 case 3 SparseCoreScsCodecBase<…> argument list) selects SparseCoreScalarImmediatesEncoder, so 0x1eb5bd20 is the function actually invoked on the SCS branch path; the full SparseCoreImmediatesEncoder::Encode (slots 0..5, bits 215/195) lives at 0x1ecd1760. The branch/call offset (imm slot 0 = bit 67) is identical between the two encoders.

NOTE — EncodeBundle 0x1e838cc0 is the 6acc60406 codec dispatcher specifically: its default arm reports "EncodeBundle not implemented for sequencer type", and its three live cases construct gfc::isa::{TensorCore,SparseCoreScs,SparseCoreTec}CodecBase<…>. Ghostlite has its own EncoderGlTensorCore::EncodeBundle (0x1d331d00). The slot-walk mechanism is shared across v5+, but the dispatcher symbol is generation-specific.

TensorCore Bundle (64 B / 512 bit)

TensorCoreImmediatesEncoder::Encode reads imm_0..imm_5 (proto +0x18..+0x2c) and BitCopys each at width 20. The per-generation bit positions, confirmed verbatim from the encoder BitCopy literals:

imm slotproto fieldvxc (VF)glc (GL)gfc (GF)width
0 (branch/call offset)+0x18430 (0x1ae)433 (0x1b1)423 (0x1a7)20
1+0x1c41041340320
2+0x2039039338320
3+0x2437037336320
4+0x2835035334320
5+0x2c33033332320

Encoder addresses: vxc 0x1eebee40, glc 0x1f20d520, gfc 0x1f86de20.

The per-generation delta is the central fact: Ghostlite = Viperfish + 3 bits (the TC scalar/sequencer/immediate region shifts +3 to make room for the 7→8-bit opcode widening), and 6acc60406 = Viperfish − 7 bits for the immediate block (the immediate region moves down to make room for the wider scalar/predicate region above it, which on gfc includes the dedicated dual-predicate slot at bits 496..505).

GOTCHA — the immediate slot index in EmitImmediate is logical, not a fixed bit. Slot 0 is bit 67 on SCS but bit 430/433/423 on TC. A reimplementation that hardcodes one bit for "immediate slot 0" will corrupt every cross-engine branch. Resolve the bit from the (engine, generation) pair, not the slot index alone.


Sequencer Slot Map

Purpose

The sequencer slot (ScalarAlu0) carries the branch/call discriminator, the call return-address (dest) sreg, the branch-by-register x-target sreg, and a predication header. The branch/call target offset is not here — it is in immediate slot 0. The sequencer slot is the position the per-gen bundle pages defer to this page; the TC sequencer in particular could not be bit-extracted from InstBits because InstBits is empty.

Discriminator Model

Every ScalarAlu0 op writes an opcode-HIGH field (width 6, family) and an opcode-LOW field (width 5, addressing discriminator). For all branch/call control ops opcode-HIGH = 0; the scalar-ALU compute ops carry their opcode in opcode-HIGH instead (e.g. CompareIntegerEq = 0x1e). The opcode-LOW discriminator values are uniform across SCS and TC, all V5+ generations:

opcode-LOWOpExtra fields
4BranchAbsoluteoffset → imm slot 0
5BranchRelativeoffset → imm slot 0
6CallAbsoluteoffset → imm slot 0; dest (link) sreg → dest field
7CallRelativeoffset → imm slot 0; dest (link) sreg → dest field
4 (family field used)BranchSregx-target sreg → x-target field
0x18 (24)BranchRelativeRotatingPregrotating-preg index → dedicated field (gfc SCS only)

The branch offset range is signed 20-bit (−0x80000..+0x7FFFF) for absolute, relative, and call alike; the abs/rel distinction is purely the discriminator value. A return is not a dedicated op — it is a BranchSreg reading the link sreg.

SCS Sequencer (SparseCoreScalarAlu0Encoder::Encode)

The encoder writes a common predication header, then dispatches jmp *jt[proto+0x50] (bound 0x56 = 86 entries) to a per-op helper. Confirmed from glc 0x1e9d2140 and the BranchAbsolute helper 0x1e9d67c0; Viperfish (vfc encoder 0x1ee82ce0, BranchAbsolute helper 0x1ee873c0) is byte-identical.

fielddst_bithexwidthSource / written by
predication reg index1870xbb4proto +0x20, main encoder
predication inversion1910xbf1proto +0x18 (byte), main encoder
opcode-HIGH / family1810xb56per-op helper (=0 for branch/call)
opcode-LOW / discriminator1760xb05per-op helper (4/5/6/7/0x18)
x-target / 2nd operand170 (gfc) / 176 (vfc/glc)0xaa / 0xb06 / 5BranchSreg/aux (gfc 0x1eb6dd40, vfc 0x1ee87480)
call dest (return-addr) sreg1650xa55CallAbsolute/CallRelative
rotating-preg index (gfc)1650xa54BranchRelativeRotatingPreg

On 6acc60406 (gfc encoder 0x1eb693c0) the SCS predication narrows: a 3-bit selector at bit 187 (0xbb) + inversion at bit 190 (0xbe), overlaid with a 4-bit dual-predicate index also at bit 187 + inversion at bit 191. The gfc BranchRelativeRotatingPreg helper (0x1eb6b9c0) writes discriminator 24 at bit 176 (w5), opcode-HIGH 0 at bit 181 (w6), rotating-preg index at bit 165 (w4), and a 6-bit aux at bit 170.

TC Sequencer (TensorCoreScalarAlu0Encoder::Encode)

The slot InstBits could not hold. Confirmed from vxc 0x1eecb900 (and helper 0x1eecf960), glc 0x1f219b40 (and helper 0x1f21da40), and gfc 0x1f87b420.

fieldvxc (VF)glc (GL)gfc (GF)
predication reg indexbit 499 w4bit 502 w4— (2-bit selector)
predication inversionbit 503 w1bit 506 w1
predication 2-bit selectorbit 489 w2
opcode-HIGH / familybit 493 w6bit 496 w6bit 483 w6
opcode-LOW / discriminatorbit 488 w5bit 491 w5bit 478 w5
x-target / 2nd operandbit 482 w6bit 485 w6bit 472 w6
call dest (return-addr) sregbit 477 w5bit 480 w5bit 467 w5

Ghostlite shifts the entire TC scalar/sequencer region +3 bits above Viperfish, in lockstep with the +3-bit TC immediate-block shift (433 vs 430) — the whole TC scalar/sequencer/immediate block translates as one rigid window to absorb the 7→8-bit opcode widening (glc::isa::TensorCoreScalarAlu0Encoder::Encode 0x1f219b40 and its BranchAbsolute helper 0x1f21da40: predicate reg @ 502 not 499, inversion @ 506 not 503, opcode-HIGH @ 496 not 493, opcode-LOW @ 491 not 488, x-target aux @ 485 not 482, call dest @ 480 not 477). The SCS sequencer is not shifted — there glc is byte-identical to vxc (0x1e9d2140); see Ghostlite Bundle. 6acc60406's TC scalar slot is the widest: it adds the 6-bit operand at bit 472 and shrinks per-slot predication to a 2-bit selector at bit 489, with the actual 16-register predicate pool moved to the dedicated TensorCorePredicates slot.

NOTE — the TC vxc x-target spans two distinct fields. Bit 488 (w5) is the opcode-LOW discriminator (shared with the BranchSreg x-target value); the secondary-operand / LCC-read aux field is a separate 6-bit field at bit 482 (0x1da), per the vxc encoder body (0x1eecb900). A BranchSreg overwrites the discriminator window with the x() sreg, but an LCC read uses both the 5-bit discriminator at 488 and the 6-bit aux at 482.


Predicate Slot Map

Purpose

Pin the exact byte offset of the predicate field within each V5+ bundle slot — the position prior predicate-field analysis left open (it was "governed by InstBits", which is empty). The model differs between the Viperfish/Ghostlite per-slot scheme and the 6acc60406 dedicated dual-predicate slot.

Viperfish / Ghostlite — Per-Slot 4+1 Field

Every populated functional slot carries its own predicate: a 4-bit register index plus a 1-bit inversion (the 2-bit extension of the encodePredicateOperand layout is the high end of the field, 0 in non-rotating code). At the top of the scalar slot:

Slotreg indexinversionEncoder
TC ScalarAlu0 (vxc)bit 499 w4bit 503 w10x1eecb900
TC ScalarAlu0 (glc)bit 502 w4bit 506 w10x1f219b40
SCS ScalarAlu0 (glc/vfc)bit 187 w4bit 191 w10x1e9d2140 / 0x1ee82ce0

6acc60406 — Dedicated Dual-Predicate Slot

TensorCorePredicatesEncoder::Encode (gfc 0x1f86e500) writes two per-bundle predicates into the very top of the 64-byte TC bundle; each functional slot then carries only a 2-bit selector choosing among {pred_0, pred_1, always, never}:

fielddst_bithexwidthproto src
pred_0 reg5010x1f54msg word
pred_0 inversion5050x1f91msg +0x20 (byte)
pred_1 reg4960x1f04msg word
pred_1 inversion5000x1f41msg +0x21 (byte)

The 16-register predicate pool (the PredicationSlot enum 0..15) is encoded into pred_0/pred_1; the per-slot 2-bit selector indexes the pool. The "predication overflow — both predicate slots already taken" condition is exactly the state where these two 4-bit (reg, inversion) entries at bits 496..505 are full. The exact value→{pred_0,pred_1,always,never} mapping of the 2-bit selector was not decoded (the selector's own jump table was not walked); the selector field offsets are confirmed (LOW confidence on the value semantics).


Compute Slots — MXU, Result, EUP

Purpose

The MXU matmul/push/latch (VectorExtended), the matres/EUP pop (VectorResult), and the transcendental push (VALU slot 3) come from the same <Slot>Encoder::Encode + BitCopy mechanism. They are documented in full on MXU Slot, VPU Slot, and EUP/Transcendental Slot; the absolute bit positions are consolidated here.

MXU VectorExtended Slot

Two MXU slots (VectorExtended0, VectorExtended1) — one per physical matrix unit. The two slots share the source-vreg (systolic-feed) region but have opcode/control regions offset by a 25-bit slot stride on 6acc60406. The opcode field widens 7→8 bits across generations, mirroring the VALU slot. Confirmed from gfc MatrixMultiplyBf16 0x1f99a920:

fieldvxcglcgfc VEx0gfc VEx1
MXU-id (unit) [proto +0x1c]bit 64 w4bit 66 w4bit 70 w2bit 45 w2
opcode-HIGHbit 57 w7bit 58 w8bit 62 w8bit 37 w8
data-format sub-discbit 51 w4bit 52 w4bit 57 w4bit 32 w4
done-gains/latch flagbit 55 w2bit 56 w1bit 61 w1bit 36 w1
control (3-bit)bit 48 w3bit 49 w3bit 54 w3bit 29 w3
primary operandbit 180 w6bit 183 w6bit 47 w7bit 22 w7
src vregs (8 × 6-bit, gfc)156/276/287/243/254/210/221/177(same)
opcode bound (#ops)0x66 (103)0x70 (113)0x54 (85)0x54 (85)

Encoder addresses: vxc 0x1efa0f60, glc 0x1f32fd00, gfc VEx0 0x1f996940 / VEx1 0x1f9d3800. The weight latch is LoadMatrixRegister{Gmr,Lmr}{Msra,Msrb} (opcode-HIGH 0x37 on gfc, 0x1f9a04a0); the moving-operand push is PushMatrix{fmt} (opcode-HIGH 0xe). The 8 source-vreg fields being byte-identical between VEx0 and VEx1 is the encoding statement that both MXUs draw the same vector read ports.

VectorResult Slot (matres pop / EUP pop)

VectorResult0Encoder::Encode reads a result-type discriminator (proto +0x1c), dispatches jmp *jt[proto+0x50], sets the per-result-type sub-message present tag, then a common tail BitCopys the dest vreg. Confirmed from gfc 0x1fa01820:

fieldvxcglcgfc
result-type discriminatorbit 24 w4bit 24 w4bit 20 w2
dest vregbit 14 w6bit 14 w6bit 11 w6
PopMxu accum-mode/format(per +0x1c)(per +0x1c)bit 323 w8
result-opcode bound0x8 (9)0x8 (9)0x7 (8)

The matres-pop opcode is 6 (PopMxuResult) and the EUP-pop opcode is 7 (PopEupResult) on all generations. Ghostlite adds PopAddMxu01Result (fused matres+accumulate, K>128 multi-pass); Viperfish adds PopCcrfResult (scalar/CRF pop). The result slot's own predication field accessor is TensorCoreVectorResult1PredicationField::GetConcatenatedValue (gfc 0x1fa02520); its exact bit was not individually walked (the adjacent result-mode/format fields at bits 17..21 are decoded).

EUP / Transcendental Push — VALU Slot 3

On all V5+ generations the transcendental push is a VALU slot-3 (Alu3) op, not a VectorExtended op — confirmed by the EUP helpers existing only in the Alu3 set. The VALU opcode field selects the EUP-push family (value 0x0); a 5-bit EUP-function selector picks the transcendental. Confirmed from gfc F32Tanh 0x1f96ae40:

fieldvxcglcgfc
VALU opcode (EUP-push family = 0x0)bit 197 w7bit 194 w8bit 194 w8
EUP-function selectorbit 186 w5bit 183 w5bit 183 w5
src vreg(slot-3)bit 188 w6bit 188 w6

The gfc 5-bit function selector value map (VALU op = 0x0, selector @ bit 183), confirmed per-helper:

functionF32 selectorBf16 selector
Erf0x0e0x0f
ReciprocalSqrt0x100x0c
PowTwo (2^x)0x110x19
LogTwo (log2)0x120x1a
Tanh0x130x1b
ShiftedSigmoid0x140x1c
Reciprocal0x150x1d
Sinq (sin)0x170x1e
Cosq (cos)0x180x1f

The push-pop protocol is bit-exact: PUSH in bundle N is the VALU slot-3 op above; POP in bundle N+k is a VectorResult op with result-opcode 7 (PopEupResult), dest vreg at bit 11 (gfc). The XLU is single-issue, so only VALU slot 3 sources the EUP push.


Per-(Slot, Generation) Absolute-Bit-Position Table

The consolidated reference cited by the per-generation bundle pages. All positions are bit <n> w<width>; bit 0 = LSB of byte 0.

SCS Bundle (32 B / 256 bit)

slot / fieldvfc (VF)glc (GL)gfc (GF)
branch/call offset (imm 0)67 w2067 w2067 w20
imm slot 147 w2047 w2047 w20
imm slot 227 w2027 w2027 w20
imm slot 37 w207 w207 w20
seq opcode-HIGH181 w6181 w6181 w6
seq opcode-LOW (discriminator)176 w5176 w5176 w5
seq x-target / 2nd operand176 w5176 w5170 w6
seq call dest sreg165 w5165 w5165 w5
seq predicate reg187 w4187 w4187 w3 (selector) / w4 (dual)
seq predicate inversion191 w1191 w1190 w1
rotating-preg index165 w4

TensorCore Bundle (64 B / 512 bit)

slot / fieldvxc (VF)glc (GL)gfc (GF)
branch/call offset (imm 0)430 w20433 w20423 w20
imm slots 1..5410/390/370/350/330413/393/373/353/333403/383/363/343/323
seq opcode-HIGH493 w6496 w6483 w6
seq opcode-LOW (discriminator)488 w5491 w5478 w5
seq x-target / 2nd operand (aux)482 w6485 w6472 w6
seq call dest sreg477 w5480 w5467 w5
seq per-slot predicatereg 499 w4 + inv 503 w1reg 502 w4 + inv 506 w12-bit selector @ 489
dual predicate pred_0reg 501 w4 + inv 505 w1
dual predicate pred_1reg 496 w4 + inv 500 w1
MXU opcode-HIGH (VEx0)57 w758 w862 w8
MXU data-format sub-disc51 w452 w457 w4
MXU-id (unit)64 w466 w470 w2
VectorResult discriminator24 w424 w420 w2
VectorResult dest vreg14 w614 w611 w6
EUP-push VALU opcode (Alu3)197 w7194 w8194 w8
EUP-function selector186 w5183 w5183 w5

The generation deltas, in one line: Ghostlite shifts the entire TC scalar/sequencer/immediate window +3 bits (immediate slot 0 bit 430 → 433; sequencer opcode-HIGH 493 → 496) as one rigid block to absorb the 7→8-bit opcode widening; 6acc60406 shifts the TC immediate block −7 bits (bit 430 → 423) and the scalar/sequencer region down to clear room for the dedicated dual-predicate slot at bits 496..505 and the wider scalar operand at bit 472.


Delay-Slot and Loop Notes

Two fields a stock LLVM-MC mental model expects to be encoded are not in-bundle bit fields on V5+:

  • Delay-slot count. No V5+ branch/call helper (Ghostlite, Viperfish, or 6acc60406) emits a delay_slots BitCopy. The branch/call helpers write only {opcode-HIGH, opcode-LOW, offset (imm 0), dest}. The delay-slot count is a bundle-packer pad count (empty bundles appended after the branch), gated by the LLVM-MC verifier bound (delay_slots <= 5) on the packer, not an encoded slot field.
  • Hardware-loop length. V5+ has no hardware-loop setup slot bit field. A loop is the LCC hardware counter read (ReadRegisterLccLow/High, the sequencer-slot opcode at bit 181 + dst at bit 176) feeding a conditional BranchRelative. The "loop counter" is a register, not an encoded loop-length field.

GOTCHA — a reimplementation that allocates a 3-bit delay-slot field inside the bundle (as a BarnaCore-style v4 layout would have) will desynchronize every subsequent field. On V5+, after the branch's dest field the next bits belong to the next slot, not to a delay count.


Cross-References

  • IsaEmitter Registry — the isa_emitter::EmitX template family and per-generation codec registration that produce the protos this page serializes
  • Viperfish 64B Bundle — the vxc/vfc slot map; cites this page for the sequencer/immediate/predicate positions
  • Ghostlite Bundle — the glc slot map and the +3-bit TC shift
  • 6acc60406 Bundle — the gfc slot map, dedicated dual-predicate slot, and 2-bit per-slot selector
  • MC-Emitter — the MCInst stream that feeds Stage 1 EmitX
  • Record Format — the on-disk record framing around the encoded bundle bytes
  • Sequencer Slot — branch/call/LCC op semantics for the discriminator values mapped here
  • Predicate Slot — the predication model whose absolute offsets this page pins
  • MXU Slot / VPU Slot / EUP/Transcendental Slot — compute-slot semantics for the consolidated MXU/result/EUP positions
  • InstBits Master DB — the all-zero V5+ InstBits table that confirms the bits live in the emitter path