Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

BCS 32-Byte Bundle

All addresses, offsets, and bit positions on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build libtpu_lts_20260413_b_RC00, build-id 89edbbe81c5b328a958fe628a9f2207d). The ELF is not stripped; full C++ symbols are present. .text VMA equals file offset (0xe63c000); all addresses are analysis VMAs. Other versions will differ.

Abstract

BarnaCore is the legacy TPU embedding accelerator (chip generations Jellyfish through Pufferfish). Its Pufferfish-generation (pxc/pfc) instruction word is a 32-byte / 256-bit VLIW bundle, encoded by the same bit-packing machinery the SparseCore codec uses — a single BitCopy(dst, dst_bitoff, src, src_bitoff, nbits) primitive (@0x1fa0a900) writing fields at absolute bundle-bit positions. There are two bundle personalities sharing the same 32-byte word width: the Sequencer bundle (BarnaCoreSequencerBundle), a 2-wide dual-scalar control/memory word, and the Channel bundle (BarnaCoreChannelBundle), a 6-slot vector-datapath word that performs one fully-pipelined embedding-row transform per cycle. Both are produced by pufferfish::isa::EncoderPfBarnaCore{Sequencer,Channel}, whose BundleSizeBytes() accessors both return 32 (@0x1d229220, @0x1d22bb00 — confirmed in decompile).

The reverse-engineering technique mirrors the SparseCore SCS/TEC work: the per-personality BarnaCore{Sequencer,Channel}CodecBase::Encode dispatcher holds the output buffer Span (pointer + length) in fixed registers and passes it unchanged to every per-slot Encoder::Encode, so each slot writes at bundle-relative — i.e. absolute — bit offsets. Each field's (dst_bitoff, nbits) pair comes from the mov esi,IMM / mov r8d,IMM immediates before its BitCopy call; the per-op 6-bit hardware opcode value comes from the mov QWORD PTR[rbp-0x18],N constant written into the opcode field. The codec template name is the structural ground truth: BarnaCoreChannelCodecBase<BarnaCoreChannelBundle, BarnacoreChannelPredication, …VectorExtendedResult{Decoder,Encoder}, …VectorStore…, …VectorLoad…, …VectorAlu0…, …VectorAlu1…, …Scalar…>::Encode (@0x1d22c560) enumerates the six slots in dispatch order.

This page documents four artifacts a reimplementer must reproduce: (1) the Sequencer bundle bit-layout — two 27-bit scalar slots plus four 16-bit immediates; (2) the Channel bundle bit-layout — six slots plus four immediates; (3) the BcsMetadataAccessor descriptor format — a 14-type flat SMEM-word table indexed by BcsProgramMetadataType; and (4) the MigrateInstruction ordinal map — the Alu0Alu1 vector-lane-swap set that pins which 54 ops are lane-agnostic and which 7 are lane-locked. The per-op opcode rosters in proto-oneof order, the embedding-lowering datapath, and the SparseCore bundle this mirrors are owned by sibling pages (see Cross-References).

For reimplementation, the contract is:

  • The bit-packing model: every field is BitCopy(buf, absolute_bitoff, src, src_bitoff, nbits); the dispatcher reuses one Span across all slots, so bit offsets are bundle-absolute, not slot-relative.
  • The two 32-byte bundle layouts: Sequencer (bits 3..132 used) and Channel (bits 12..238 used), each with its slot/opcode/predication/immediate placement.
  • The 27-bit scalar-slot template shared by both Sequencer pipes (and bit-identical to SparseCore SCS): operand-y / operand-x / dest / OPCODE / predication.
  • The BcsMetadataAccessor: per-type SMEM-word base/count arrays + per-field offset enums; a field read is Sld(SmemWordAddress(base(type) + offset_enum)).
  • The MigrateInstruction map: 54 ops × 2 directions = 108 templates that serialize/deserialize an op between ALU lanes; 7 ops have no template (lane-locked).
Bundle size32 bytes / 256 bits (both personalities)
Size accessorsEncoderPfBarnaCoreSequencer::BundleSizeBytes @0x1d229220 → 32; …Channel::BundleSizeBytes @0x1d22bb00 → 32
Bit packerBitCopy(void*, int dst_bitoff, void const*, int src_bitoff, int nbits) @0x1fa0a900
Sequencer codecBarnaCoreSequencerCodecBase<…>::Encode @0x1d229780 (2 scalar slots)
Channel codecBarnaCoreChannelCodecBase<…>::Encode @0x1d22c560 (6 slots)
Scalar slot encodersScalar0 @0x1ee51ec0 (base bit 106); Scalar1 @0x1ee69000 (base bit 79)
Channel slot encodersChannelScalar @0x1e87e6a0, VectorAlu0 @0x1e8927c0, VectorAlu1 @0x1e8b4ec0, VectorStore @0x1e8c5640, VectorLoad @0x1e8c4900, VectorExtendedResult @0x1e8c3c60
Metadata accessorBcsMetadataAccessor ctor @0xf9d8b80; MetadataSmemWordBase @0xf9d8ba0; MetadataSmemWordCount @0xf9d8c40
Migrate mapPufferfishBarnaCoreChannelEmitter::MigrateInstruction<Alu0_X, Alu1_X> — 108 instances (54 ops × 2) @0x140d17c0..0x140d6c00
DMA buffer widthTpuTypedHostDmaBuffer<unsigned char, 32> (the Li32E template arg on the channel program encoder) — 32-byte alignment

1. The Bundle Encoding Model

Purpose

Both BarnaCore bundles are flat 256-bit words whose fields are written by a shared bit-copy primitive at absolute positions. Understanding the encoding model is the prerequisite for reading every offset table below: a reimplementer who treats the per-slot bit offsets as slot-relative will mis-place every field, because the dispatcher does not rebase the buffer per slot.

Entry Point

EncoderPfBarnaCoreChannel::EncodeBundle (0x1d22b780)
  └─ EncodeBundleInternal (0x1e87d580)
       └─ BarnaCoreChannelCodecBase<…>::Encode (0x1d22c560)   ── buf.ptr in %r14, buf.len in %rbx
            ├─ VectorExtendedResultEncoder::Encode (0x1e8c3c60)   ── same Span
            ├─ VectorStoreEncoder::Encode          (0x1e8c5640)   ── same Span
            ├─ VectorLoadEncoder::Encode           (0x1e8c4900)   ── same Span
            ├─ VectorAlu0Encoder::Encode           (0x1e8927c0)   ── same Span (called per oneof form)
            ├─ VectorAlu1Encoder::Encode           (0x1e8b4ec0)   ── same Span
            └─ ChannelScalarEncoder::Encode        (0x1e87e6a0)   ── same Span
                 └─ (each field) BitCopy (0x1fa0a900)

The Sequencer path is identical in shape: EncoderPfBarnaCoreSequencer::EncodeBundle (0x1d228ea0)BarnaCoreSequencerCodecBase<…>::Encode (0x1d229780){Scalar0,Scalar1}Encoder::Encode.

Algorithm

// Models BarnaCoreChannelCodecBase<…>::Encode @0x1d22c560.
// a3 = buf.ptr, a4 = buf.len  (absl::Span<unsigned char>).
function ChannelCodec_Encode(bundle, buf_ptr, buf_len):       // %r14 = buf_ptr, %rbx = buf_len
    // The Span is NEVER rebased: every slot encoder receives (buf_ptr, buf_len)
    // unchanged, so each writes at bundle-absolute bit offsets.
    EncodeSlot(bundle.extended_result, buf_ptr, buf_len)       // 0x1e8c3c60
    EncodeSlot(bundle.store,           buf_ptr, buf_len)       // 0x1e8c5640
    EncodeSlot(bundle.load,            buf_ptr, buf_len)       // 0x1e8c4900
    EncodeSlot(bundle.alu0,            buf_ptr, buf_len)       // 0x1e8927c0
    EncodeSlot(bundle.alu1,            buf_ptr, buf_len)       // 0x1e8b4ec0
    EncodeSlot(bundle.scalar,          buf_ptr, buf_len)       // 0x1e87e6a0
    // tail: validation logging only — no separate predication / check-byte write;
    // predication is emitted inline by each slot encoder.

function EncodeField(buf, op):                                 // per-op helper
    BitCopy(buf, DST_BITOFF, &src, SRC_BITOFF, NBITS)          // 0x1fa0a900
    //          ^esi imm      ^      ^ecx imm   ^r8d imm
    // OPCODE value comes from `mov QWORD PTR[rbp-0x18], N` before the opcode BitCopy.

QUIRK — the bit offsets in every table below are bundle-absolute, recovered from the mov esi,IMM immediate feeding BitCopy. They are not slot-relative. The slot-relative template (operand-y@+0 … OPCODE@+16 … predication@+22) is the same 27-bit shape for both scalar pipes; the absolute placement differs only by the slot base (Scalar1 at 79, Scalar0 at 106).

NOTE — the codec dispatcher's tail is pure absl::log_internal validation, not a framing-byte write. PufferfishCodecMetadata::BundleCheckByte/BundleSizeBytes (the TC-style 0x55 check byte / 0x33=51 framing) belong to TpuSequencerType=0; the BarnaCore encoders route through TpuSequencerType=1 and their own BundleSizeBytes() returns 32 directly with no check byte.

Function Map

FunctionAddressRole
BitCopy0x1fa0a900Bit-granular field writer; (dst, dst_bitoff, src, src_bitoff, nbits)
BarnaCoreSequencerCodecBase<…>::Encode0x1d229780Sequencer dispatcher; reuses one Span across both scalar slots
BarnaCoreChannelCodecBase<…>::Encode0x1d22c560Channel dispatcher; reuses one Span across 6 slots
EncoderPfBarnaCoreSequencer::BundleSizeBytes0x1d229220return 32
EncoderPfBarnaCoreChannel::BundleSizeBytes0x1d22bb00return 32

2. The Sequencer Bundle (InstBits_BarnaCorePxcHwMode)

Purpose

The Sequencer bundle is a 2-wide dual-scalar VLIW carrying the control-flow and SMEM-access ISA: branches, calls, fences, sync/done, DMA descriptors, SMEM load/store, and the scalar integer/float ALU. Two 27-bit scalar slots (Scalar0, Scalar1) stack 27 bits apart and share a region of four 16-bit immediate slots in the bundle's low bits. The symbol InstBits_BarnaCorePxcHwMode (@0x33931f0, a static .rodata table inside the LLVM backend's TPUMCCodeEmitter::getBinaryCodeForInstr — the per-instruction bit-pattern table for this hardware-mode bundle) is the independent compiler-side ground truth for the layout decoded below from the proto encoders.

Encoding

The bundle uses only bits 3..132 of its 256 bits (the lower ~17 bytes). Bits 0..2 and 133..255 are reserved/unwritten — confirmed by sweeping every scalar op encoder and both dispatcher heads; the maximum written bit is 132. The upper 16 bytes are padding, present so the Sequencer and Channel bundles share the same physical word width (and therefore the same 32-byte DMA transfer unit), even though the Sequencer fills only the lower half.

Each scalar slot is the 27-bit template (slot-relative): operand-y @+0/w5, operand-x @+5/w6 (a ScalarY-style 6-bit selector that names either a scalar register or one of the four immediate slots), dest @+11/w5, OPCODE @+16/w6, predication @+22/w5. This is bit-identical to the SparseCore SCS scalar template; the only delta is that BarnaCore predication is a single 5-bit field (BarnaCoreSequencerScalar{0,1}PredicationField, both confirmed in symbols) rather than the SCS multi-field quad.

Sequencer bundle (32 B / 256 bits; only bits 3..132 written)

bit:  15      31      47      63        79 ............ 105   106 ........... 132   133 .......... 255
     +-------+-------+-------+-------+   +------------------+  +------------------+  +---------------+
     | Imm0  | Imm1  | Imm2  | Imm3  |   |   Scalar1 slot   |  |   Scalar0 slot   |  |   RESERVED    |
     | w16   | w16   | w16   | w16   |   |   27 bits @79    |  |   27 bits @106   |  |  (unwritten)  |
     +-------+-------+-------+-------+   +------------------+  +------------------+  +---------------+
                                          opcode@95  pred@101    opcode@122 pred@128

Scalar0 slot (base 106; opcode value = §4 Table)

FieldBase bitWidthSlot offsetRole
SyField (operand y)1065+0scalar-reg selector
SxField (operand x)1116+5scalar-reg OR immediate-slot (ScalarY-style)
DestField1175+11result scalar-reg selector
OPCODE1226+166-bit hardware opcode
PredicationField1285+22BarnaCoreSequencerScalar0PredicationField

Scalar1 slot (base 79; opcode value = §4 Table)

FieldBase bitWidthSlot offsetRole
operand y795+0scalar-reg selector
operand x846+5scalar-reg OR immediate (ScalarY-style)
dest905+11result scalar-reg selector
OPCODE956+166-bit hardware opcode
PredicationField1015+22BarnaCoreSequencerScalar1PredicationField

Shared immediate slots (both scalar pipes)

FieldBase bitWidth
Imm0Field1516
Imm1Field3116
Imm2Field4716
Imm3Field6316

GOTCHA — DMA ops are a oneof-of-lane: ScalarDmaSimple/ScalarSingleStridedDma/ScalarGeneralDma do not fit a single scalar slot. Their opcode rides the Scalar0 opcode field (@122), but the descriptor (source/dest address, core id, memory id, length, dest sync flag, dma type — names from the ScalarDmaSimple{Source,Dest}{Address,CoreId,MemoryId}/Length/DestSyncFlag/DmaType0 field accessors) spills through the Scalar1 region, the four immediate slots, and extra low bits. ScalarGeneralDma (@0x1ee55120) is the widest, spanning bits 0..127 (the entire lower 16 bytes). A reimplementer must treat a DMA bundle as consuming the entire dual-scalar capacity, not one lane.

Function Map

FunctionAddressRole
BarnaCoreSequencerScalar0Encoder::Encode0x1ee51ec0Scalar0 slot encoder; base 106, jump table @0xb84559c (bound ≤0x3e)
BarnaCoreSequencerScalar1Encoder::Encode0x1ee69000Scalar1 slot encoder; base 79, jump table @0xb845698 (bound ≤0x3c)
ScalarIntAdd (Scalar0)0x1ee55ee0op=0x20; Sy@106/Sx@111/Dest@117
ScalarGeneralDma (Scalar0)0x1ee55120widest op; spans bits 0..127
ScalarStoreSmemAbsolute (Scalar1)0x1ee6ace0op=0x6

3. The Channel Bundle

Purpose

The Channel bundle is the 6-slot vector datapath. In one cycle it performs a fully pipelined embedding-row transform: advance the feature-length loop, run two vector-ALU ops on the two lanes, store the previous row, load the next row, and drain an EUP (transcendental unit) result. The six slots are placed at ascending absolute bit positions; the codec template (@0x1d22c560) lists them in the order ExtendedResult / Store / Load / Alu0 / Alu1 / Scalar.

Encoding

The bundle fills bits 12..238 of its 256-bit frame; bits 0..11 and 239..255 are reserved (max written bit = 238, where the fourth immediate ends). The two vector-ALU lanes are 33 bits apart (a 31-bit slot plus a 2-bit inter-lane gap). Every VectorAlu encoder also writes a shared 2-bit field group at bits 35/37/39 — the vector lane/mask header at the vector-region base (role inferred from the VxField/YSrc accessor neighborhood; bit positions exact).

Channel bundle (32 B / 256 bits; bits 12..238 written)

12        35       62 67    92 95 100    125  126 128   147 149  167 172  175    191    207    223  238
+----------+--------+--------+----+--------+----+--------+----+------+------+----+-------+------+------+------+
| Channel  | (2-bit |VectorAlu0| | VectorAlu1| |VectorStore| VectorLoad |VExtRes| Imm0 | Imm1 | Imm2 | Imm3 |
| Scalar   | lane   | opcode@67 | | opcode@100| | pred@128  | pred@149   |pred@167| w16 | w16  | w16  | w16  |
| loop ctl | hdr@35 | pred@62   | | pred@95   | | form@126  | form@147   |       |
+----------+--------+----------+-+-----------+-+-----------+------------+--------+------+------+------+------+
SlotEncoderBit extentOpcode / pred bitsRole
ChannelScalar0x1e87e6a012 .. 59type@12/w2; count@16/w8feature-length loop control (LoopStart/NormalOp/OpBranch)
VectorAlu00x1e8927c035 .. 92opcode@67/w6; pred@62/w5vector ALU lane 0 (carries VectorFloatMul; 54 shared ops)
VectorAlu10x1e8b4ec035 .. 125opcode@100/w6; pred@95/w5vector ALU lane 1 (carries VectorFloatAdd/Sub + 4 shifts)
VectorStore0x1e8c5640126 .. 147form@126/w2; pred@128/w5embedding-row store (Base/Source/FeatureLen/Stride)
VectorLoad0x1e8c4900147 .. 167form@147/w2; pred@149/w5embedding-row load (Base/Dest/FeatureLen)
VectorExtendedResult0x1e8c3c60167 .. 174pred@167/w5; @172/w1, @173/w2EUP-result drain
ScalarImmediates(in op enc.)175 .. 2384 × 16-bit @175/191/207/223shared literals (Imm0..Imm3)

The VectorAlu template

Slot-relative to the opcode field: OPCODE w6 at the base, four 5-bit VREG operand selectors above it (Dest/Vx/YSrc/YSrcVreg — names from the VectorFloatMul{Dest,Vx,YSrc,YSrcVreg}Field accessors, all confirmed in symbols), and the 5-bit predication field below it. VectorFloatMul also declares Imm0..Imm5Field (confirmed).

The ChannelScalar loop controller

The ChannelScalar slot is the per-cycle feature-length loop sequencer. Its fields (all confirmed as BarnaCoreChannelScalar*Field symbols): LoopStartLoopInstructionCount, LoopStartPipelineDepth, OpBranch{BranchTargetPc,BranchType,BranchPred}, OpShift{,PipelineStage}, OpPush, PredFeatureId, ProgEnd, and a family of AddLoopIndexTo{ShiftAmount,VldDst,VstSrc,V0Dst,V0X,V0YReg,V0VselectMask,V1Dst,V1X,V1YReg,V1VselectMask} selectors that add the loop index to per-lane operand registers.

NOTE — the decoder declares six channel immediate fields (Imm0..Imm5Field, e.g. VectorFloatMulImm4Field/Imm5Field confirmed in symbols), but every sampled op encoder writes only the four at 175/191/207/223 (4 × 16-bit fills bits 175..238, exactly the bundle's written maximum). Imm4/Imm5 are declared-but-unwritten in v0.0.40 — reserved or op-form-specific (LOW confidence on their placement, since the encode side never sets them).

Function Map

FunctionAddressRole
BarnaCoreChannelVectorAlu0Encoder::Encode0x1e8927c0Alu0 slot; opcode@67/w6, pred@62/w5; jump table @0xb834e8c (bound ≤0x2e)
BarnaCoreChannelVectorAlu1Encoder::Encode0x1e8b4ec0Alu1 slot; opcode@100/w6, pred@95/w5
BarnaCoreChannelScalarEncoder::Encode0x1e87e6a0loop controller; type@12/w2, count@16/w8
VectorFloatMul (Alu0)0x1e894420opcode value 0x7
VectorTanh (Alu0)0x1e89edc0opcode value 0x33
VectorReciprocal (Alu0)0x1e89ee40opcode value 0x34

4. Hardware Opcode Values (6-bit field)

The 6-bit opcode field is written with a literal value (mov QWORD PTR[rbp-0x18],N) per op. These are the hardware opcode values, a separate ordering from the proto-oneof ordinals; the two orderings are independent and both now pinned.

Scalar0 / Scalar1 dispatch dimensions

The scalar pipes share a common ALU tail at identical values and split on pipe-only ops in the gaps:

AxisValuesSource
Shared header0x00..0x03 = Noop/Sync(0x1)/Pop(0x2)/Delay(0x3)both jump tables
Shared ALU tailIntAdd=0x20, IntSub=0x21, And=0x22, Or=0x23, Xor=0x24, Move=0x2e, IntEqual=0x30 — same value in both pipesper-op mov QWORD
Scalar0-onlyBranch{Abs,Rel,Reg}=0x8/0x9/0xa, Call=0xc.., Fence=0x10, Dma=0x12, IssueFsm=0x15, ReadRegs=0x1d, ConvI2F=0x1e, FloatMul=0x27, UintMul=0x28, FloatMax=0x29, IsInfOrNan=0x3eScalar0 encoders
Scalar1-onlyLoadSmem=0x4, LoadSmemOffset=0x5, StoreSmemAbsolute=0x6, ReadDone=0x16, WriteDone=0x17, ReadPublicAccess=0x18, WritePublicAccess=0x19, FloatAdd=0x25, FloatSub=0x26Scalar1 encoders

All seven Sync* ops share opcode 0x1, distinguished by a sub-form field (≤0x3e fits the 6-bit field).

Channel VectorAlu0

OpcodeOpOpcodeOp
0x03VectorOr0x20VectorIntEqual
0x04VectorXor0x27CreateSublaneMask
0x07VectorFloatMul0x2fCreateLaneMask
0x08VectorFloatMax0x30VectorReciprocalSquareRoot
0x09VectorFloatMin0x31VectorPow2
0x18VectorLaneId0x32VectorLog2
0x1eVectorRelux0x33VectorTanh
0x1fVectorMove0x34VectorReciprocal
0x35MoveDataUnchanged

QUIRK — the EUP transcendental block is contiguous in the hardware opcode space at 0x30..0x34 (rsqrt → pow2 → log2 → tanh → recip), exactly as it is contiguous in the proto oneof. This is independent confirmation — from the 6-bit opcode value, not the proto ordinal — of the within-block EUP order. VectorRelux (0x1e) sits in the early ALU range, separate from the transcendental run, mirroring its role as an embedding activation rather than a transcendental.


5. The BcsMetadataAccessor Descriptor Format

Purpose

Per-op embedding metadata (the programming surface a gather/scatter op needs: where its rows live, how big the buffers are, which partition column, which HBM address) is not in the bundle — it lives in SMEM as a flat per-type word array. BcsMetadataAccessor is the runtime reader. This is the BarnaCore analog of the SparseCore TAC descriptor table: the descriptor lives in SMEM (programmed by the driver's SetBarnaCoreFeature* path), the emitter loads it into scalar registers, and the address math runs on the shared scalar ALU.

Encoding

The accessor object holds two driver-populated int32 arrays indexed by BcsProgramMetadataType (confirmed enum symbol): the SMEM-word base at this[+0x10] (length at [+0x18]) and the word count at this[+0x28] (length at [+0x30]). MetadataSmemWordBase (@0xf9d8ba0) and MetadataSmemWordCount (@0xf9d8c40) are the bounds-checked lookups. A descriptor field read is:

// field(type, offset_enum) =
value = Sld( SmemWordAddress( MetadataSmemWordBase(type) + offset_enum ) );
//            SmemWordAddress @0xf9d9720 ; LloRegionBuilder::Sld @0x1d516a20

The type number for each accessor comes from the mov esi,N; call MetadataSmemWordBase site inside the accessor body. The 14 types (mangled accessor symbols all confirmed in …_names.json):

TypeBcsProgramMetadataTypeAccessor methodAddressOffset enum
0x1DedupTransferLoadDedupTransferMetadata0xf9d90e0BcsDedupTransferMetadataOffset
0x2PassHeaderLoadPassHeaderMetadata0xf9d8d40BcsPassHeaderMetadataOffset {2,3,4,5,6,..}
0x3PayloadLocationLoadPayloadLocationMetadata0xf9d8da0BcsPayloadLocationMetadataOffset (per-id)
0x4LocalBufferLoadLocalBufferSize0xf9d8e40
0x5RemoteBufferSizeLoadRemoteBufferSize0xf9d8f80
0x6BmemWordAddressLoadBmemWordAddressFromMetadata0xf9d9140
0x7RemoteBufferOffsetLoadRemoteBufferOffset0xf9d8fe0
0x8PartitionColumnLoadPartitionColumn0xf9d92e0(base+0, no offset)
0x9ScatterGroupLoadScatterGroup0xf9d9340
0xaBarnaCoreLocationLoadBarnaCoreLocation0xf9d93a0
0xbTensorCoreLocationLoadTensorCoreLocation0xf9d9460
0xcAbsoluteHbmGetBarnaCoreAbsoluteHbmAddress0xf9d9520
0xdTensorCoreDmaAddrGetBarnaCoreAbsoluteHbmAddressForTensorCoreDma0xf9d95c0
0xeBackwardPassSlotSelectorLoad/StoreBackwardPassSlotSelector0xf9d9660 / 0xf9d96c0

LoadCommandMetadata (@0xf9d8ce0) takes the type as a runtime arg (BcsCommandMetadataOffset enum, confirmed). The constructor is BcsMetadataAccessor(TpuTopology const*, int, int, absl::Span<int const>) (@0xf9d8b80).

NOTE — the symbol table also carries LoadTensorCoreBufferSelector and LoadTpuBatchCount accessors not enumerated above — additional metadata readers on the same SMEM-word model (their type numbers were not separately decoded; HIGH confidence they follow the same base(type)+offset pattern, LOW on their specific type indices).


6. The MigrateInstruction Ordinal Map

Purpose

PufferfishBarnaCoreChannelEmitter::MigrateInstruction<Alu0_X, Alu1_X> (and the reverse <Alu1_X, Alu0_X>) is the vector-ALU lane-swap mechanism. The scheduler uses it to move a vector op between the two ALU lanes when one is free. The map is the runtime, message-level proof of the Alu0/Alu1 lane asymmetry: a Migrate template exists exactly for the lane-agnostic ops, and is absent for the lane-locked ops.

Encoding

Each MigrateInstruction body is a proto round-trip: SerializeToString (@0x210580c0) the source-lane message, then ParseFromString (@0x21057460) into the destination-lane message type. The two lane types share their field layout, so the round-trip is lossless. The op message types live in asic_sw::deepsea::pxc::pfc::isa::BarnaCoreChannelVectorAlu{0,1}_<Op>.

There are 108 instances = 54 ops × 2 directions (@0x140d1400..0x140d6c00; <Alu0_Noop,Alu1_Noop> @0x140d17c0, <Alu0_VectorTanh,Alu1_VectorTanh> @0x140d4c80). The 54 migratable ops (all confirmed as BarnaCoreChannelVectorAlu0_<Op>Alu1_<Op> template pairs in symbols):

Noop, VectorIntAdd, VectorIntSub, VectorAnd, VectorOr, VectorXor,
VectorFloatMax, VectorFloatMin, VectorLaneId, VectorSublaneCircularRotateDown,
VectorRelux, VectorMove, VectorClampSymmetric, VectorPopCount,
VectorCountLeadingZeros, VectorSelectVmsk0..7 (8), VectorPackAsHalfFloats{Interleaved,Compressed},
VectorUnpackHalfFloats{Upper,Lower}, VectorConvert{IntToFloat,FloatToInt},
VectorExtractExponent, VectorExtractSignificand, VectorComposeFloat,
VectorInt{Equal,NotEqual,Greater,GreaterEqual,Less,LessEqual}, VectorIntAddCarryOut,
CreateSublaneMask, VectorFloat{Equal,NotEqual,Greater,GreaterEqual,Less,LessEqual},
VectorFloatIsInfOrNan, CreateLaneMask, VectorReciprocalSquareRoot, VectorPow2,
VectorLog2, VectorTanh, VectorReciprocal, MoveDataUnchanged

The 7 lane-locked ops have no Migrate template — the scheduler cannot move them:

OpLocked toReason
VectorFloatMulAlu0float-multiply lane
VectorFloatAddAlu1float-add lane
VectorFloatSubAlu1float-add lane
VectorLogicalShiftLeftAlu1shift unit
VectorLogicalShiftRightAlu1shift unit
VectorArithmeticShiftRightAlu1shift unit
VectorRoundingArithmeticShiftRightAlu1shift unit

QUIRK — the asymmetry is real hardware structure: float-mul is Alu0-only, float-add/sub and all four shifts are Alu1-only. A scheduler that assumes the two lanes are interchangeable will mis-place these seven ops. The presence/absence of a Migrate template is the lane-capability table.

NOTE — there are zero Scalar0Scalar1 Migrate instances. Scalar dual-issue is resolved statically at schedule time by the FindFreeScalarSlot<Scalar0_X, Scalar1_Y> template instantiations, never by a runtime message migration. (The whole-binary MigrateInstruction<…> template-instance count is 222 — the 108 BarnaCore-channel pairs above plus 114 from PufferfishTensorCoreEmitter; the BarnaCore-channel set is the 108 in the address range above.)


NameRelationship
BarnaCoreSequencerCodecBase / BarnaCoreChannelCodecBaseThe two per-personality bundle dispatchers; reuse one Span across slots
EncoderPfBarnaCore{Sequencer,Channel}The pufferfish::isa encoders; BundleSizeBytes() == 32
BitCopy (@0x1fa0a900)The shared SparseCore/BarnaCore bit packer
PufferfishBarnaCoreChannelEmitterOwns MigrateInstruction; the lane-swap scheduler hook

Cross-References

  • Overview — BarnaCore, the legacy embedding accelerator: where this bundle sits in the pipeline
  • BCS Scalar0/Scalar1 ISA — the per-op control+memory ISA rosters whose opcodes §4 gives the hardware values for
  • Merged-ALU Bit LayoutVectorResultDestination / BaseAddressEncoding, the vector-ALU operand selectors referenced in §3
  • JF/DF 16-Byte Address-Handler Bundle — the legacy (non-Pufferfish) BarnaCore bundle; a separate 16-byte word filled by direct struct write, not BitCopy
  • Index — Part IX — SparseCore & BarnaCore / BarnaCore (legacy v2–v4)