V5+ EmitX Absolute Bit Positions
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so, BuildID89edbbe81c5b328a958fe628a9f2207d, 781,691,048 B, not stripped,.textVA == file offset). Other builds will differ.
Abstract
Jellyfish (v2) and Pufferfish (v4) build a bundle with a single monolithic encoder — EncoderJf::EncodeBundleInternal (0x1e86c7c0) and EncoderPfTensorCore::EncodeBundleInternal (0x1e8c5c40) — that takes one Bundle object and packs every slot inline. Every V5+ generation in this build (Viperfish/vxc+vfc, Ghostlite/glc, 6acc60406/gfc) abandons that model entirely. There is no EncodeBundleInternal for any V5+ codec. Instead the bundle is produced by a two-stage chain: upstream, an xla::tpu::sparse_core::isa_emitter::EmitX proto-template populates a typed proto sub-message and sets a present bit; downstream, a per-slot <Slot>Encoder::Encode codec reads that proto and writes each field to a fixed absolute bit position in the flat bundle byte buffer via one universal bit-granular packer, BitCopy(dst, dst_bit, src, src_bit, nbits) (0x1fa0a900). This page consolidates those absolute bit positions for the sequencer, immediate, and predicate slots that the per-generation bundle pages defer to it.
The structural consequence is that the LLVM-MC InstBits table — which on a stock LLVM backend holds the fixed opcode bits of each instruction — is all zero on disk for every V5+ generation. There are no fixed instruction bits in a static table; the entire bundle layout is a flat list of (absolute-bit, width) triples distributed across roughly two hundred per-op BitCopy helpers and orchestrated by a single EncodeBundle dispatcher. The set of per-slot BitCopy offsets is the generation's effective kIsaTable. A reimplementer who looks for a static InstBits array, or for EncoderVf::EncodeBundleInternal, will find neither.
The page is organized by the two encode stages, then by slot. The opening BitCopy and EmitImmediate sections establish the primitive and its proto-field convention; the immediate, sequencer, and predicate sections give the absolute (dst_bit, width) maps per generation; the MXU/result/EUP section consolidates the compute-slot positions; the closing per-(slot, gen) table is the single reference the Viperfish, Ghostlite, and 6acc60406 bundle pages cite for the deferred fields.
For reimplementation, the contract is:
- The two-stage chain:
EmitXproto population (Stage 1) →<Slot>Encoder::EncodeBitCopy(Stage 2). No monolithicEncodeBundleInternal. - The
BitCopy(dst, dst_bit, src, src_bit, nbits)calling convention and bit-exact, LSB-first semantics. - The
EmitImmediateproto-slot map (which of 6 immediate slots a value lands in, and its present bit). - The absolute (dst_bit, width) for the sequencer (branch/call/LCC) discriminator, dest/x-target sreg, and predication fields, per generation, for the 32-byte SCS bundle and the 64-byte TC bundle.
- The per-generation absolute-bit deltas: Viperfish baseline → Ghostlite +3-bit TC shift → 6acc60406 (its own layout, dedicated dual-predicate slot, 2-bit per-slot selector).
| Universal packer | BitCopy(void* dst, int dst_bit, const void* src, int src_bit, int nbits) @ 0x1fa0a900 (mangled _Z7BitCopyPviPKvii) |
| Calling convention | dst=rdi, dst_bit=esi, src=rdx, src_bit=ecx, nbits=r8d (System V); bit-granular, LSB-first |
| Stage-1 emitter namespace | xla::tpu::sparse_core::isa_emitter::EmitX<…> (EmitImmediate, EmitBranchOp, EmitCallOp, EmitPredicationToSlot) |
| Stage-2 codec namespace | asic_sw::deepsea::{vxc,vxc::vfc,gxc::glc,gxc::gfc}::isa::<Slot>Encoder::Encode |
| Orchestrator | EncodeBundle 0x1e838cc0 (6acc60406 codec; TC worker 0x1d371540), EncoderGlTensorCore::EncodeBundle 0x1d331d00 (Ghostlite) |
| SCS bundle | 32 B / 256 bit; branch/call offset (imm slot 0) at bit 67, all V5+ gens |
| TC bundle | 64 B / 512 bit; branch/call offset (imm slot 0) at bit 430 (vxc) / 433 (glc) / 423 (gfc) |
InstBits table | 0x3366d90 — all zero, no relocations, for every V5+ gen |
The Two-Stage Encode Chain
Purpose
Separate what an instruction means (a typed proto sub-message, generation-independent) from where its bits go (a per-generation BitCopy offset map). Stage 1 runs in the front-half emitter and produces a *Bundle proto with a populated sub-message per occupied slot. Stage 2 runs in the codec and serializes that proto into the wire bundle. The split is why the same LLO op encodes to three different bit layouts across Viperfish, Ghostlite, and 6acc60406 with no change to the front-end emitter logic — only the codec's offset literals differ.
Entry Point
LLO MCInst
└─ isa_emitter::EmitX<Slot, OpKind> ── Stage 1: populate proto sub-message
│ sets present bit, writes operand fields
▼
<gen> proto bundle (SparseCoreScsBundle / TensorCoreBundle / …)
│
EncodeBundle 0x1e838cc0 (dispatch on TpuSequencerType)
├─ case 0 → TensorCoreCodecBase<…> → TC worker 0x1d371540
├─ case 3 → SparseCoreScsCodecBase<…> → EncoderBase::EncodeBundle
└─ case 5 → SparseCoreTecCodecBase<…> → EncoderBase::EncodeBundle
│ walk per-slot Encoders in template-arg order
▼
<Slot>Encoder::Encode ── Stage 2: BitCopy fields → bundle bytes
└─ BitCopy(buf, dst_bit, &field, 0, width) 0x1fa0a900
Algorithm
Stage 1, immediate population, confirmed from the demangled isa_emitter::EmitImmediate<glc::isa::SparseCoreImmediates> body (0x139f7060):
function EmitImmediate(slot_index, value, immediates_msg): // 0x139f7060
if value >= 0x100000: // llvm::isUInt<20> RET_CHECK
return Error("isa_emitter_base.h:587") // (signed 20-bit offsets are pre-masked)
switch slot_index: // jump on slot 0..5
case 0: field = msg+0x18; present = 0x01; break // imm_0()
case 1: field = msg+0x1c; present = 0x02; break // imm_1()
case 2: field = msg+0x20; present = 0x04; break // imm_2()
case 3: field = msg+0x24; present = 0x08; break // imm_3()
case 4: field = msg+0x28; present = 0x10; break // imm_4()
case 5: field = msg+0x2c; present = 0x20; break // imm_5()
default: return Error("Invalid immediate: <n>")
if (msg+0x10 & present) && *field != value: // RET_CHECK: re-set must match
return Error("imm_N() == value")
*field = value
msg+0x10 |= present // set present bit at msg+0x10
return Ok
Stage 2, the per-slot encoder, reads each populated sub-message and emits a sequence of BitCopy calls. The encoder body is a switch on the proto opcode discriminator — at proto+0x50 on the Viperfish/Ghostlite scalar encoders and on the gfc TC sequencer, at proto+0x58 on the gfc SCS sequencer and gfc predicates encoder (the gfc proto grows a wider header); each arm is one op family's field layout. The generic shape (from the immediates encoders) is:
function <Slot>Encoder::Encode(this, proto, out_buf):
scratch[0] = proto.imm_0 // stage value into 8-byte scratch
BitCopy(out_buf, ABS_BIT_0, scratch, 0, WIDTH_0) // place WIDTH_0 bits at ABS_BIT_0
scratch[0] = proto.imm_1
BitCopy(out_buf, ABS_BIT_1, scratch, 0, WIDTH_1)
... // one BitCopy per field
return Ok
NOTE —
EmitImmediatedoes not know the absolute bit. It only chooses one of six proto fields and sets a present bit. The absolute bit is decided entirely in Stage 2 by the encoder'sBitCopyliteral. This is why the immediate-slot bit positions differ between SCS (bit 67) and TC (bit 430) for the same logical immediate slot 0 — same proto field, different encoder.
BitCopy — The Universal Packer
Purpose
One function packs every field of every V5+ bundle. There is no per-type serializer; each <Slot>Encoder::Encode is a flat list of BitCopy calls. Recovering the layout of any slot reduces to reading the (dst_bit, nbits) immediate pair preceding each BitCopy call in the named encoder.
Calling Convention and Semantics
BitCopy is _Z7BitCopyPviPKvii — BitCopy(void* dst, int dst_bit, const void* src, int src_bit, int nbits). Under System V the arguments land in rdi, esi, rdx, ecx, r8d. The decompiled body (0x1fa0a900) confirms bit-exact, LSB-first behavior:
function BitCopy(dst, dst_bit, src, src_bit, nbits): // 0x1fa0a900
if nbits == 0: return // early-out, then vzeroupper
dst_byte = dst_bit / 8 // = dst_bit >> 3
dst_off = dst_bit & 7 // bit-in-byte
src_byte = src_bit / 8
src_off = src_bit & 7
// copy nbits from src starting at (src_byte, src_off) into dst at (dst_byte, dst_off),
// LSB-first, preserving the surrounding bits of any straddled dst byte
// (vectorized inner loop when the run is >= 24 bits; scalar head/tail otherwise)
The body computes dst_bit / 8 and dst_bit & 7 exactly as written, masks the leading/trailing partial bytes so neighboring fields are untouched, and uses an AVX inner loop for runs of 24 bits or more. Bit 0 is the LSB of byte 0 of the bundle buffer.
QUIRK — because every field is an independent
BitCopyinto a shared buffer, field order does not matter for correctness as long as windows do not overlap. The encoder writes the predication header first, then dispatches per-op, but a reimplementation may emit the calls in any order. Overlap is possible by design: on 6acc60406 SCS the 3-bit predicate selector and the 4-bit dual-predicate index both start at bit 187 (see Predicate Slot).
Every concrete mov esi,<dst_bit>; mov r8d,<width>; call BitCopy therefore reads directly as "place width bits at absolute bit dst_bit". All positions below were recovered from those immediates in the named encoders.
Immediate Slot Map
Purpose
The branch/call/sync target offset and all other immediates live in the immediate slots, not in any opcode field. A branch's 20-bit signed target offset is immediate slot 0; the opcode discriminator in the sequencer slot only distinguishes absolute/relative/call. This is the home that the per-generation bundle pages defer to this page.
SCS Bundle (32 B / 256 bit)
SparseCoreImmediatesEncoder::Encode reads proto fields a2[6]..a2[11] (proto +0x18..+0x2c, i.e. imm_0..imm_5) and BitCopys each at width 20. Confirmed byte-identical across all three generations:
| imm slot | proto field | dst_bit (hex) | dst_bit (dec) | width |
|---|---|---|---|---|
| 0 (branch/call offset) | +0x18 | 0x43 | 67 | 20 |
| 1 | +0x1c | 0x2f | 47 | 20 |
| 2 | +0x20 | 0x1b | 27 | 20 |
| 3 | +0x24 | 0x07 | 7 | 20 |
| 4 | +0x28 | 0xd7 | 215 | 20 |
| 5 | +0x2c | 0xc3 | 195 | 20 |
Encoder addresses: vfc 0x1ee75ee0, glc 0x1eb563c0, gfc 0x1ecd1760. Slots 4 and 5 (bits 215/195) appear in the full SparseCoreImmediatesEncoder; the SCS branch/call path uses only slots 0..3. A second class, SparseCoreScalarImmediatesEncoder (gfc 0x1eb5bd20), packs only slots 0..3 at the same bits 67/47/27/7 — it is the encoder the gfc SCS codec template names. The gfc SCS codec template (confirmed in the EncodeBundle 0x1e838cc0 case 3 SparseCoreScsCodecBase<…> argument list) selects SparseCoreScalarImmediatesEncoder, so 0x1eb5bd20 is the function actually invoked on the SCS branch path; the full SparseCoreImmediatesEncoder::Encode (slots 0..5, bits 215/195) lives at 0x1ecd1760. The branch/call offset (imm slot 0 = bit 67) is identical between the two encoders.
NOTE —
EncodeBundle0x1e838cc0is the 6acc60406 codec dispatcher specifically: itsdefaultarm reports "EncodeBundle not implemented for sequencer type", and its three live cases constructgfc::isa::{TensorCore,SparseCoreScs,SparseCoreTec}CodecBase<…>. Ghostlite has its ownEncoderGlTensorCore::EncodeBundle(0x1d331d00). The slot-walk mechanism is shared across v5+, but the dispatcher symbol is generation-specific.
TensorCore Bundle (64 B / 512 bit)
TensorCoreImmediatesEncoder::Encode reads imm_0..imm_5 (proto +0x18..+0x2c) and BitCopys each at width 20. The per-generation bit positions, confirmed verbatim from the encoder BitCopy literals:
| imm slot | proto field | vxc (VF) | glc (GL) | gfc (GF) | width |
|---|---|---|---|---|---|
| 0 (branch/call offset) | +0x18 | 430 (0x1ae) | 433 (0x1b1) | 423 (0x1a7) | 20 |
| 1 | +0x1c | 410 | 413 | 403 | 20 |
| 2 | +0x20 | 390 | 393 | 383 | 20 |
| 3 | +0x24 | 370 | 373 | 363 | 20 |
| 4 | +0x28 | 350 | 353 | 343 | 20 |
| 5 | +0x2c | 330 | 333 | 323 | 20 |
Encoder addresses: vxc 0x1eebee40, glc 0x1f20d520, gfc 0x1f86de20.
The per-generation delta is the central fact: Ghostlite = Viperfish + 3 bits (the TC scalar/sequencer/immediate region shifts +3 to make room for the 7→8-bit opcode widening), and 6acc60406 = Viperfish − 7 bits for the immediate block (the immediate region moves down to make room for the wider scalar/predicate region above it, which on gfc includes the dedicated dual-predicate slot at bits 496..505).
GOTCHA — the immediate slot index in
EmitImmediateis logical, not a fixed bit. Slot 0 is bit 67 on SCS but bit 430/433/423 on TC. A reimplementation that hardcodes one bit for "immediate slot 0" will corrupt every cross-engine branch. Resolve the bit from the (engine, generation) pair, not the slot index alone.
Sequencer Slot Map
Purpose
The sequencer slot (ScalarAlu0) carries the branch/call discriminator, the call return-address (dest) sreg, the branch-by-register x-target sreg, and a predication header. The branch/call target offset is not here — it is in immediate slot 0. The sequencer slot is the position the per-gen bundle pages defer to this page; the TC sequencer in particular could not be bit-extracted from InstBits because InstBits is empty.
Discriminator Model
Every ScalarAlu0 op writes an opcode-HIGH field (width 6, family) and an opcode-LOW field (width 5, addressing discriminator). For all branch/call control ops opcode-HIGH = 0; the scalar-ALU compute ops carry their opcode in opcode-HIGH instead (e.g. CompareIntegerEq = 0x1e). The opcode-LOW discriminator values are uniform across SCS and TC, all V5+ generations:
| opcode-LOW | Op | Extra fields |
|---|---|---|
| 4 | BranchAbsolute | offset → imm slot 0 |
| 5 | BranchRelative | offset → imm slot 0 |
| 6 | CallAbsolute | offset → imm slot 0; dest (link) sreg → dest field |
| 7 | CallRelative | offset → imm slot 0; dest (link) sreg → dest field |
| 4 (family field used) | BranchSreg | x-target sreg → x-target field |
| 0x18 (24) | BranchRelativeRotatingPreg | rotating-preg index → dedicated field (gfc SCS only) |
The branch offset range is signed 20-bit (−0x80000..+0x7FFFF) for absolute, relative, and call alike; the abs/rel distinction is purely the discriminator value. A return is not a dedicated op — it is a BranchSreg reading the link sreg.
SCS Sequencer (SparseCoreScalarAlu0Encoder::Encode)
The encoder writes a common predication header, then dispatches jmp *jt[proto+0x50] (bound 0x56 = 86 entries) to a per-op helper. Confirmed from glc 0x1e9d2140 and the BranchAbsolute helper 0x1e9d67c0; Viperfish (vfc encoder 0x1ee82ce0, BranchAbsolute helper 0x1ee873c0) is byte-identical.
| field | dst_bit | hex | width | Source / written by |
|---|---|---|---|---|
| predication reg index | 187 | 0xbb | 4 | proto +0x20, main encoder |
| predication inversion | 191 | 0xbf | 1 | proto +0x18 (byte), main encoder |
| opcode-HIGH / family | 181 | 0xb5 | 6 | per-op helper (=0 for branch/call) |
| opcode-LOW / discriminator | 176 | 0xb0 | 5 | per-op helper (4/5/6/7/0x18) |
| x-target / 2nd operand | 170 (gfc) / 176 (vfc/glc) | 0xaa / 0xb0 | 6 / 5 | BranchSreg/aux (gfc 0x1eb6dd40, vfc 0x1ee87480) |
| call dest (return-addr) sreg | 165 | 0xa5 | 5 | CallAbsolute/CallRelative |
rotating-preg index (gfc) | 165 | 0xa5 | 4 | BranchRelativeRotatingPreg |
On 6acc60406 (gfc encoder 0x1eb693c0) the SCS predication narrows: a 3-bit selector at bit 187 (0xbb) + inversion at bit 190 (0xbe), overlaid with a 4-bit dual-predicate index also at bit 187 + inversion at bit 191. The gfc BranchRelativeRotatingPreg helper (0x1eb6b9c0) writes discriminator 24 at bit 176 (w5), opcode-HIGH 0 at bit 181 (w6), rotating-preg index at bit 165 (w4), and a 6-bit aux at bit 170.
TC Sequencer (TensorCoreScalarAlu0Encoder::Encode)
The slot InstBits could not hold. Confirmed from vxc 0x1eecb900 (and helper 0x1eecf960), glc 0x1f219b40 (and helper 0x1f21da40), and gfc 0x1f87b420.
| field | vxc (VF) | glc (GL) | gfc (GF) |
|---|---|---|---|
| predication reg index | bit 499 w4 | bit 502 w4 | — (2-bit selector) |
| predication inversion | bit 503 w1 | bit 506 w1 | — |
| predication 2-bit selector | — | — | bit 489 w2 |
| opcode-HIGH / family | bit 493 w6 | bit 496 w6 | bit 483 w6 |
| opcode-LOW / discriminator | bit 488 w5 | bit 491 w5 | bit 478 w5 |
| x-target / 2nd operand | bit 482 w6 | bit 485 w6 | bit 472 w6 |
| call dest (return-addr) sreg | bit 477 w5 | bit 480 w5 | bit 467 w5 |
Ghostlite shifts the entire TC scalar/sequencer region +3 bits above Viperfish, in lockstep with the +3-bit TC immediate-block shift (433 vs 430) — the whole TC scalar/sequencer/immediate block translates as one rigid window to absorb the 7→8-bit opcode widening (glc::isa::TensorCoreScalarAlu0Encoder::Encode 0x1f219b40 and its BranchAbsolute helper 0x1f21da40: predicate reg @ 502 not 499, inversion @ 506 not 503, opcode-HIGH @ 496 not 493, opcode-LOW @ 491 not 488, x-target aux @ 485 not 482, call dest @ 480 not 477). The SCS sequencer is not shifted — there glc is byte-identical to vxc (0x1e9d2140); see Ghostlite Bundle. 6acc60406's TC scalar slot is the widest: it adds the 6-bit operand at bit 472 and shrinks per-slot predication to a 2-bit selector at bit 489, with the actual 16-register predicate pool moved to the dedicated TensorCorePredicates slot.
NOTE — the TC
vxcx-target spans two distinct fields. Bit 488 (w5) is the opcode-LOW discriminator (shared with theBranchSregx-target value); the secondary-operand / LCC-read aux field is a separate 6-bit field at bit 482 (0x1da), per thevxcencoder body (0x1eecb900). ABranchSregoverwrites the discriminator window with the x() sreg, but an LCC read uses both the 5-bit discriminator at 488 and the 6-bit aux at 482.
Predicate Slot Map
Purpose
Pin the exact byte offset of the predicate field within each V5+ bundle slot — the position prior predicate-field analysis left open (it was "governed by InstBits", which is empty). The model differs between the Viperfish/Ghostlite per-slot scheme and the 6acc60406 dedicated dual-predicate slot.
Viperfish / Ghostlite — Per-Slot 4+1 Field
Every populated functional slot carries its own predicate: a 4-bit register index plus a 1-bit inversion (the 2-bit extension of the encodePredicateOperand layout is the high end of the field, 0 in non-rotating code). At the top of the scalar slot:
| Slot | reg index | inversion | Encoder |
|---|---|---|---|
TC ScalarAlu0 (vxc) | bit 499 w4 | bit 503 w1 | 0x1eecb900 |
TC ScalarAlu0 (glc) | bit 502 w4 | bit 506 w1 | 0x1f219b40 |
SCS ScalarAlu0 (glc/vfc) | bit 187 w4 | bit 191 w1 | 0x1e9d2140 / 0x1ee82ce0 |
6acc60406 — Dedicated Dual-Predicate Slot
TensorCorePredicatesEncoder::Encode (gfc 0x1f86e500) writes two per-bundle predicates into the very top of the 64-byte TC bundle; each functional slot then carries only a 2-bit selector choosing among {pred_0, pred_1, always, never}:
| field | dst_bit | hex | width | proto src |
|---|---|---|---|---|
| pred_0 reg | 501 | 0x1f5 | 4 | msg word |
| pred_0 inversion | 505 | 0x1f9 | 1 | msg +0x20 (byte) |
| pred_1 reg | 496 | 0x1f0 | 4 | msg word |
| pred_1 inversion | 500 | 0x1f4 | 1 | msg +0x21 (byte) |
The 16-register predicate pool (the PredicationSlot enum 0..15) is encoded into pred_0/pred_1; the per-slot 2-bit selector indexes the pool. The "predication overflow — both predicate slots already taken" condition is exactly the state where these two 4-bit (reg, inversion) entries at bits 496..505 are full. The exact value→{pred_0,pred_1,always,never} mapping of the 2-bit selector was not decoded (the selector's own jump table was not walked); the selector field offsets are confirmed (LOW confidence on the value semantics).
Compute Slots — MXU, Result, EUP
Purpose
The MXU matmul/push/latch (VectorExtended), the matres/EUP pop (VectorResult), and the transcendental push (VALU slot 3) come from the same <Slot>Encoder::Encode + BitCopy mechanism. They are documented in full on MXU Slot, VPU Slot, and EUP/Transcendental Slot; the absolute bit positions are consolidated here.
MXU VectorExtended Slot
Two MXU slots (VectorExtended0, VectorExtended1) — one per physical matrix unit. The two slots share the source-vreg (systolic-feed) region but have opcode/control regions offset by a 25-bit slot stride on 6acc60406. The opcode field widens 7→8 bits across generations, mirroring the VALU slot. Confirmed from gfc MatrixMultiplyBf16 0x1f99a920:
| field | vxc | glc | gfc VEx0 | gfc VEx1 |
|---|---|---|---|---|
| MXU-id (unit) [proto +0x1c] | bit 64 w4 | bit 66 w4 | bit 70 w2 | bit 45 w2 |
| opcode-HIGH | bit 57 w7 | bit 58 w8 | bit 62 w8 | bit 37 w8 |
| data-format sub-disc | bit 51 w4 | bit 52 w4 | bit 57 w4 | bit 32 w4 |
| done-gains/latch flag | bit 55 w2 | bit 56 w1 | bit 61 w1 | bit 36 w1 |
| control (3-bit) | bit 48 w3 | bit 49 w3 | bit 54 w3 | bit 29 w3 |
| primary operand | bit 180 w6 | bit 183 w6 | bit 47 w7 | bit 22 w7 |
src vregs (8 × 6-bit, gfc) | — | — | 156/276/287/243/254/210/221/177 | (same) |
| opcode bound (#ops) | 0x66 (103) | 0x70 (113) | 0x54 (85) | 0x54 (85) |
Encoder addresses: vxc 0x1efa0f60, glc 0x1f32fd00, gfc VEx0 0x1f996940 / VEx1 0x1f9d3800. The weight latch is LoadMatrixRegister{Gmr,Lmr}{Msra,Msrb} (opcode-HIGH 0x37 on gfc, 0x1f9a04a0); the moving-operand push is PushMatrix{fmt} (opcode-HIGH 0xe). The 8 source-vreg fields being byte-identical between VEx0 and VEx1 is the encoding statement that both MXUs draw the same vector read ports.
VectorResult Slot (matres pop / EUP pop)
VectorResult0Encoder::Encode reads a result-type discriminator (proto +0x1c), dispatches jmp *jt[proto+0x50], sets the per-result-type sub-message present tag, then a common tail BitCopys the dest vreg. Confirmed from gfc 0x1fa01820:
| field | vxc | glc | gfc |
|---|---|---|---|
| result-type discriminator | bit 24 w4 | bit 24 w4 | bit 20 w2 |
| dest vreg | bit 14 w6 | bit 14 w6 | bit 11 w6 |
| PopMxu accum-mode/format | (per +0x1c) | (per +0x1c) | bit 323 w8 |
| result-opcode bound | 0x8 (9) | 0x8 (9) | 0x7 (8) |
The matres-pop opcode is 6 (PopMxuResult) and the EUP-pop opcode is 7 (PopEupResult) on all generations. Ghostlite adds PopAddMxu01Result (fused matres+accumulate, K>128 multi-pass); Viperfish adds PopCcrfResult (scalar/CRF pop). The result slot's own predication field accessor is TensorCoreVectorResult1PredicationField::GetConcatenatedValue (gfc 0x1fa02520); its exact bit was not individually walked (the adjacent result-mode/format fields at bits 17..21 are decoded).
EUP / Transcendental Push — VALU Slot 3
On all V5+ generations the transcendental push is a VALU slot-3 (Alu3) op, not a VectorExtended op — confirmed by the EUP helpers existing only in the Alu3 set. The VALU opcode field selects the EUP-push family (value 0x0); a 5-bit EUP-function selector picks the transcendental. Confirmed from gfc F32Tanh 0x1f96ae40:
| field | vxc | glc | gfc |
|---|---|---|---|
| VALU opcode (EUP-push family = 0x0) | bit 197 w7 | bit 194 w8 | bit 194 w8 |
| EUP-function selector | bit 186 w5 | bit 183 w5 | bit 183 w5 |
| src vreg | (slot-3) | bit 188 w6 | bit 188 w6 |
The gfc 5-bit function selector value map (VALU op = 0x0, selector @ bit 183), confirmed per-helper:
| function | F32 selector | Bf16 selector |
|---|---|---|
Erf | 0x0e | 0x0f |
ReciprocalSqrt | 0x10 | 0x0c |
PowTwo (2^x) | 0x11 | 0x19 |
LogTwo (log2) | 0x12 | 0x1a |
Tanh | 0x13 | 0x1b |
ShiftedSigmoid | 0x14 | 0x1c |
Reciprocal | 0x15 | 0x1d |
Sinq (sin) | 0x17 | 0x1e |
Cosq (cos) | 0x18 | 0x1f |
The push-pop protocol is bit-exact: PUSH in bundle N is the VALU slot-3 op above; POP in bundle N+k is a VectorResult op with result-opcode 7 (PopEupResult), dest vreg at bit 11 (gfc). The XLU is single-issue, so only VALU slot 3 sources the EUP push.
Per-(Slot, Generation) Absolute-Bit-Position Table
The consolidated reference cited by the per-generation bundle pages. All positions are bit <n> w<width>; bit 0 = LSB of byte 0.
SCS Bundle (32 B / 256 bit)
| slot / field | vfc (VF) | glc (GL) | gfc (GF) |
|---|---|---|---|
| branch/call offset (imm 0) | 67 w20 | 67 w20 | 67 w20 |
| imm slot 1 | 47 w20 | 47 w20 | 47 w20 |
| imm slot 2 | 27 w20 | 27 w20 | 27 w20 |
| imm slot 3 | 7 w20 | 7 w20 | 7 w20 |
| seq opcode-HIGH | 181 w6 | 181 w6 | 181 w6 |
| seq opcode-LOW (discriminator) | 176 w5 | 176 w5 | 176 w5 |
| seq x-target / 2nd operand | 176 w5 | 176 w5 | 170 w6 |
| seq call dest sreg | 165 w5 | 165 w5 | 165 w5 |
| seq predicate reg | 187 w4 | 187 w4 | 187 w3 (selector) / w4 (dual) |
| seq predicate inversion | 191 w1 | 191 w1 | 190 w1 |
| rotating-preg index | — | — | 165 w4 |
TensorCore Bundle (64 B / 512 bit)
| slot / field | vxc (VF) | glc (GL) | gfc (GF) |
|---|---|---|---|
| branch/call offset (imm 0) | 430 w20 | 433 w20 | 423 w20 |
| imm slots 1..5 | 410/390/370/350/330 | 413/393/373/353/333 | 403/383/363/343/323 |
| seq opcode-HIGH | 493 w6 | 496 w6 | 483 w6 |
| seq opcode-LOW (discriminator) | 488 w5 | 491 w5 | 478 w5 |
| seq x-target / 2nd operand (aux) | 482 w6 | 485 w6 | 472 w6 |
| seq call dest sreg | 477 w5 | 480 w5 | 467 w5 |
| seq per-slot predicate | reg 499 w4 + inv 503 w1 | reg 502 w4 + inv 506 w1 | 2-bit selector @ 489 |
| dual predicate pred_0 | — | — | reg 501 w4 + inv 505 w1 |
| dual predicate pred_1 | — | — | reg 496 w4 + inv 500 w1 |
| MXU opcode-HIGH (VEx0) | 57 w7 | 58 w8 | 62 w8 |
| MXU data-format sub-disc | 51 w4 | 52 w4 | 57 w4 |
| MXU-id (unit) | 64 w4 | 66 w4 | 70 w2 |
| VectorResult discriminator | 24 w4 | 24 w4 | 20 w2 |
| VectorResult dest vreg | 14 w6 | 14 w6 | 11 w6 |
| EUP-push VALU opcode (Alu3) | 197 w7 | 194 w8 | 194 w8 |
| EUP-function selector | 186 w5 | 183 w5 | 183 w5 |
The generation deltas, in one line: Ghostlite shifts the entire TC scalar/sequencer/immediate window +3 bits (immediate slot 0 bit 430 → 433; sequencer opcode-HIGH 493 → 496) as one rigid block to absorb the 7→8-bit opcode widening; 6acc60406 shifts the TC immediate block −7 bits (bit 430 → 423) and the scalar/sequencer region down to clear room for the dedicated dual-predicate slot at bits 496..505 and the wider scalar operand at bit 472.
Delay-Slot and Loop Notes
Two fields a stock LLVM-MC mental model expects to be encoded are not in-bundle bit fields on V5+:
- Delay-slot count. No V5+ branch/call helper (Ghostlite, Viperfish, or 6acc60406) emits a
delay_slotsBitCopy. The branch/call helpers write only{opcode-HIGH, opcode-LOW, offset (imm 0), dest}. The delay-slot count is a bundle-packer pad count (empty bundles appended after the branch), gated by the LLVM-MC verifier bound (delay_slots <= 5) on the packer, not an encoded slot field. - Hardware-loop length. V5+ has no hardware-loop setup slot bit field. A loop is the LCC hardware counter read (
ReadRegisterLccLow/High, the sequencer-slot opcode at bit 181 + dst at bit 176) feeding a conditionalBranchRelative. The "loop counter" is a register, not an encoded loop-length field.
GOTCHA — a reimplementation that allocates a 3-bit delay-slot field inside the bundle (as a
BarnaCore-style v4 layout would have) will desynchronize every subsequent field. On V5+, after the branch's dest field the next bits belong to the next slot, not to a delay count.
Cross-References
- IsaEmitter Registry — the
isa_emitter::EmitXtemplate family and per-generation codec registration that produce the protos this page serializes - Viperfish 64B Bundle — the
vxc/vfcslot map; cites this page for the sequencer/immediate/predicate positions - Ghostlite Bundle — the
glcslot map and the +3-bit TC shift - 6acc60406 Bundle — the
gfcslot map, dedicated dual-predicate slot, and 2-bit per-slot selector - MC-Emitter — the MCInst stream that feeds Stage 1
EmitX - Record Format — the on-disk record framing around the encoded bundle bytes
- Sequencer Slot — branch/call/LCC op semantics for the discriminator values mapped here
- Predicate Slot — the predication model whose absolute offsets this page pins
- MXU Slot / VPU Slot / EUP/Transcendental Slot — compute-slot semantics for the consolidated MXU/result/EUP positions
- InstBits Master DB — the all-zero V5+
InstBitstable that confirms the bits live in the emitter path