Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

OneSlot Scalar Router

Every address, opcode value, slot-flag bit, and jump-table bound on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (build-id 89edbbe81c5b328a958fe628a9f2207d) — from the decompiled ConsumeOneSlotInstruction body, its callee tails, and the .rodata jump table. Other versions differ.

Abstract

ConsumeOneSlotInstruction is the SparseCore SC-MLO emitter's scalar-slot router: given one decoded MCInst whose opcode names a scalar (non-vector) SC operation, it decides which physical issue slot of the bundle the op occupies — Stream, ScalarMisc, ScalarAlu (and its dual sub-slots S0/S1), or DMA — and then tail-calls the matching Consume<Slot>Instruction leaf to lower the op into that slot's proto field. It is not a per-op encoder; it is the dispatch seam one level above the per-slot consumers, the SparseCore analog of an LLVM MCInst → functional-unit binding pass. Where a TableGen-driven backend would carry the slot as an itinerary class on the instruction, libtpu carries it two ways at once: a 4019-entry jump table maps the opcode to a slot class, and for the ops that can issue on more than one scalar slot, a per-MCInst flag word (getSlotFlagsFromMCInst, MCInst+0x4) picks the concrete sub-slot at emit time.

The router is bundle-invariant. The three SC bundle types that have a scalar region — SCS, TAC, and the scalar half of the TEC — instantiate ConsumeOneSlotInstruction<Bundle> from one template, and all three share an identical opcode→arm distribution: same jump-table base (opcode − 0x1f3), same bound (0xfb2 = 4019), same ten arms with the same op-counts. Only the bundle template argument on the slot accessors differs (GetStreamSlot<…,SparseCoreScsBundle> vs …TacBundle vs …TecBundle). A reimplementer writes the router once and parameterizes it on the bundle.

This page documents the router's classify-by-opcode logic (the ten arms and their slot classes), the per-MCInst slot-flag sub-routing for the multi-slot ops, the three special arms (DMA materialization, optional-skip, no-op), the default-error path, and — because the page's job is to show how a single bundle slot reaches its sub-encoder — how the vector slot dispatch is a separate mechanism: the TEC bundle's vector slots are routed by ConsumeOneTecBundleInstruction, which reaches the 142-op VectorAlu table through ConsumeVectorAluInstruction. The vector opcode roster itself is owned by the TEC Vector Opcode Enumeration page and is linked, not duplicated, here.

For reimplementation, the contract is:

  • The classify table: opcode − 0x1f3, bound 0xfb2, ten arms. Read MCInst+0x4 into the slot-flags word and the opcode DWORD[MCInst]; index the jump table at 0xae8dce4 (TAC); dispatch to one of ten arms. Each non-special arm selects a slot class, calls Get<Slot>Slot, EmitPredicationToSlot<…>, and tail-calls Consume<Slot>Instruction.
  • The slot-flag sub-routing for multi-slot ops. The 54-opcode multi-scalar arm has no fixed slot; it tests the slot-flag bits (flags & 1 → S0, & 2 → S1, & 4 → ScalarMisc) and builds a std::variant value-visitor over {SparseCoreScalarAlu*, SparseCoreScalarMisc*}. No bit set is a hard error.
  • The DMA, optional-skip, and no-op arms. The 70-opcode DMA arm materializes a SparseCoreDma into the bundle's scalar_instruction oneof under a LogFatal-guarded precondition, then visits a {SparseCoreDma*, SparseCoreTecDma*} variant. Four opcodes are silently skipped when the consumer's bool argument is set; one opcode (0x264) returns OK with no emission.
  • The vector path is separate. ConsumeOneSlotInstruction handles only scalar slots; the TEC vector slots (VectorAlu/VectorLoad/VectorStore/VectorExtended) are dispatched by ConsumeOneTecBundleInstruction, and VectorAlu reaches its 142-op table via ConsumeVectorAluInstruction (jt base 0xb26).
RouterConsumeOneSlotInstruction<Bundle> (per-bundle scalar-slot dispatcher)
TAC entry0x139f1360; jt 0xae8dce4, base 0x1f3, bound 0xfb2 (4019 entries)
SCS / TEC entries0x13a50540 (jt 0xaea9fb4) · 0x13a15500 (jt 0xaea4ba0) — identical arm map
Classify inputDWORD[MCInst] (opcode) + getSlotFlagsFromMCInst (MCInst+0x4, slot-flag bits)
Arms10: Stream 888 · ScalarMisc 92 · ScalarAlu 49 · …S1 27 · …S0 17 · DMA 70 · multi-scalar 54 · skip 4 · no-op 1 · default 2817
Slot-flag bitsSLOT_S0 = 1 · SLOT_S1 = 2 · SLOT_SM = 4 (llvm::TPU::SparseCoreMCSlot)
Vector pathConsumeOneTecBundleInstruction 0x13a08e00ConsumeVectorAluInstruction 0x13a0b580 (separate)
Sourceplatforms/xla/sparse_core/ghostlite/isa_emitter.cc
ConfidenceCONFIRMED (decompile-anchored) unless a row or callout says otherwise

NOTE — "OneSlot" means "one scalar issue slot," not "one instruction." The name is the router's job: take one MCInst, place it into exactly one of the bundle's scalar slots. It is the scalar peer of the vector dispatcher. The 64-byte bundle layout, the scalar slot byte/bit bases (Misc @111, Alu1 @138, Alu0 @165), and the dual-issue S0/S1 geometry live on the TEC Engine and SCS Engine pages and are not repeated here.


The Classify Logic

Purpose

The router answers one question per MCInst: which scalar slot does this op issue on? The SC scalar opcode space is partitioned into contiguous-ish runs by slot class — a long Stream block, a ScalarMisc block, a ScalarAlu block with interleaved S0/S1 dual-issue runs — and the router is the table that decodes that partition. Slot assignment is the second half of SC instruction placement: the engine (which sequencer: SCS/TAC/TEC) is chosen upstream by the section-classifier; the slot within the bundle is chosen here.

Entry Point

ConsumeOneTecBundleInstruction (0x13a08e00)        ── per-MCInst TEC bundle dispatcher
  ├─ ConsumeOneSlotInstruction<Bundle> (0x139f1360 TAC)   ── THIS router: scalar slots
  │    ├─ GetStreamSlot       → ConsumeStreamInstruction        (0x139fa940)
  │    ├─ GetScalarMiscSlot   → ConsumeScalarMiscInstruction    (0x139eeca0)
  │    ├─ GetScalarAluSlot    → ConsumeScalarAluInstruction     (0x139f09c0)
  │    ├─ GetScalarAluSlotS0  → ConsumeScalarAluSlotS0Instruction (0x139f9480)
  │    ├─ GetScalarAluSlotS1  → ConsumeScalarAluSlotS1Instruction (0x139f9be0)
  │    └─ DefaultConstruct<SparseCoreDma> → variant{Dma,TecDma}  (0x13a04820)
  └─ ConsumeVectorAluInstruction (0x13a0b580)        ── separate: VectorAlu (142 ops)

Algorithm

// ConsumeOneSlotInstruction<SparseCoreTacBundle>   // glc 0x139f1360
//   args: (printer, mcinst, &bundle, bool tolerate_skip)
function ConsumeOneSlotInstruction(printer, mcinst, bundle, tolerate_skip):
    flags  = getSlotFlagsFromMCInst(mcinst)          // 0x13c798e0 → *(u32*)(mcinst+0x4)
    opcode = mcinst.opcode                            // DWORD[mcinst]
    idx    = opcode - 0x1f3                           // jt base 0x1f3
    if (unsigned)idx > 0xfb2:                          // bound check (4019 entries)
        goto DEFAULT
    switch jt[idx]:                                   // jt @0xae8dce4 (TAC), 4019×int32 rel

      MULTI_SCALAR:                                   // 54 opcodes, slot chosen by flags
        if flags & 1:   slot = GetScalarAluSlotS0(flags, bundle); vidx = 0   // SparseCoreScalarAlu
        elif flags & 2: slot = GetScalarAluSlotS1(flags, bundle); vidx = 0
        elif flags & 4: slot = GetScalarMiscSlot(flags, bundle);  vidx = 1   // SparseCoreScalarMisc
        else:           return Error("Invalid slot. Expected Scalar Slot. "  // line 5882
                                     "MCInst Flags: $0", flags)
        return variant_visit[vidx](slot, mcinst)      // {ScalarAlu*, ScalarMisc*} value-visitor

      SCALAR_ALU_S0:                                  // 17 opcodes, fixed S0
        slot = GetScalarAluSlotS0(flags, bundle)
        if EmitPredicationToSlot<…ScalarAlu>(mcinst, slot) != OK: return log(5969)
        return ConsumeScalarAluSlotS0Instruction(printer, …)

      SCALAR_ALU_S1:                                  // 27 opcodes, fixed S1
        slot = GetScalarAluSlotS1(flags, bundle)
        if EmitPredicationToSlot<…ScalarAlu>(mcinst, slot) != OK: return log(6003)
        return ConsumeScalarAluSlotS1Instruction(printer, …)

      SCALAR_ALU:                                     // 49 opcodes, generic ScalarAlu
        slot = GetScalarAluSlot(flags, bundle)        // returns StatusOr
        if EmitPredicationToSlot<…ScalarAlu>(mcinst, slot) != OK: return log(5945)
        return ConsumeScalarAluInstruction(printer, …, bundle)

      STREAM:                                         // 888 opcodes (DMA-stream descriptors)
        slot = GetStreamSlot(flags, bundle)
        if EmitPredicationToSlot<…Stream>(mcinst, slot) != OK: return log(7288)
        return ConsumeStreamInstruction(printer)

      SCALAR_MISC:                                    // 92 opcodes (sync/atomic/barrier/watch)
        slot = GetScalarMiscSlot(flags, bundle)
        if EmitPredicationToSlot<…ScalarMisc>(mcinst, slot) != OK: return log(6102)
        return ConsumeScalarMiscInstruction(printer, …, slot, bundle)

      DMA:                                            // 70 opcodes 0xfa1..0x1024 — see below
        ...materialize SparseCoreDma into scalar_instruction oneof...

      OPTIONAL_SKIP:                                  // 4 opcodes 0x100d/0x100e/0x1015/0x10f2
        if !tolerate_skip: goto DEFAULT
        return OK                                      // silently drop

      NO_OP:    return OK                              // opcode 0x264 only

      DEFAULT:                                         // 2817 opcodes (and OOB)
        return Error("Unsupported opcode while consuming slot instruction: "
                     "$0 : $1", opcode, getOpcodeName(opcode))   // line 7307

The decompile renders the jump table as a C switch, but the prologue is a true indirect jump — lea ecx,[r12-0x1f3]; cmp ecx,0xfb2; ja default; movsxd rcx,[rdx+rcx*4]; add rcx,rdx; jmp rcx — so the 4019 entries are signed 32-bit relative offsets into ten arm targets, exactly the dimension-table shape below. The EmitPredicationToSlot<…> call on every non-special arm stamps the op's predicate guard into the slot's predication header before the leaf consumer fills the slot body; a non-OK status from it converts to a logged status at the per-arm isa_emitter.cc line number.

Arm Map

The ten arms, byte-confirmed against the TAC body and its jump table; the SCS and TEC routers have the identical distribution.

ArmOpcodesSlot class → action
Stream888GetStreamSlot + EmitPredicationToSlot<…Stream> + ConsumeStreamInstruction
ScalarMisc92GetScalarMiscSlot + …<…ScalarMisc> + ConsumeScalarMiscInstruction
ScalarAlu49GetScalarAluSlot (StatusOr) + …<…ScalarAlu> + ConsumeScalarAluInstruction
ScalarAlu-S127GetScalarAluSlotS1 + ConsumeScalarAluSlotS1Instruction (fixed S1)
ScalarAlu-S017GetScalarAluSlotS0 + ConsumeScalarAluSlotS0Instruction (fixed S0)
DMA70guard oneof → clear_scalar_instructionDefaultConstruct<SparseCoreDma> → variant
Multi-scalar (flag)54flags & 1 → S0 / & 2 → S1 / & 4 → Misc; none → error
Optional-skip4if tolerate_skip return OK; else DEFAULT (0x100d/0x100e/0x1015/0x10f2)
No-op1return OK (opcode 0x264)
Default / OOB2817MakeErrorImpl "Unsupported opcode while consuming slot instruction: $0 : $1"

GOTCHA — the slot class is in the jump table, but for 54 ops the sub-slot is in the MCInst flags, not the opcode. The five fixed-slot arms (Stream, ScalarMisc, ScalarAlu, S0, S1) decide the slot from the opcode alone. The multi-scalar arm does not: it carries no fixed slot and reads the per-MCInst flag word to pick S0/S1/Misc. A reimplementer who maps opcode→slot statically will mis-route every one of those 54 ops, because the same opcode can land on a different scalar sub-slot in two different bundles depending on the scheduler's flag stamp.


The Slot-Flag Sub-Routing

The flag word and its bits

The router's second input is the slot-flags word, read by getSlotFlagsFromMCInst (0x13c798e0), whose entire body is return *((u32*)mcinst + 1) — i.e. the flags live at MCInst+0x4. The low three bits are the llvm::TPU::SparseCoreMCSlot enumeration, stamped upstream by the scheduler (setSlotFlagInMCInst, see SCS scalar opcode page):

BitMaskSparseCoreMCSlotMeaning
00x1SLOT_S0issue on scalar-ALU sub-slot 0
10x2SLOT_S1issue on scalar-ALU sub-slot 1
20x4SLOT_SMissue on the ScalarMisc slot

The multi-scalar dispatch

For the 54 multi-slot opcodes, the arm is a priority test on those bits, building a two-element std::variant value-visitor whose index selects the slot accessor's proto type:

// multi-scalar arm  (glc 0x139f14f8)
if (flags & 1):                                   // SLOT_S0 — highest priority
    slot = GetScalarAluSlotS0<SparseCoreScalarAlu,Bundle>(flags, bundle)
    visitor_index = 0                              // → SparseCoreScalarAlu*
else if (flags & 2):                               // SLOT_S1
    slot = GetScalarAluSlotS1<SparseCoreScalarAlu,Bundle>(flags, bundle)
    visitor_index = 0                              // → SparseCoreScalarAlu*
else if (flags & 4):                               // SLOT_SM
    slot = GetScalarMiscSlot<SparseCoreScalarMisc,Bundle>(flags, bundle)
    visitor_index = 1                              // → SparseCoreScalarMisc*
else:                                              // no slot bit set
    return MakeError("Invalid slot. Expected Scalar Slot. MCInst Flags: $0", flags)
return __variant_dispatch[visitor_index](visitor, slot)   // {ScalarAlu*, ScalarMisc*}

The else branch — no SLOT_S0/S1/SM bit set on an op the router believes is scalar — formats the flags through FastIntToBuffer and SubstituteAndAppendArray into the second of the router's two error strings and returns it (MakeErrorImpl<9> at isa_emitter.cc:151, source-location 5882). This is the router's internal-consistency check: the jump table claims the op is scalar, but the scheduler stamped no scalar slot, so emission cannot proceed.

QUIRK — S0 wins over S1 wins over Misc; the test order is the policy. The flag bits are checked in fixed priority S0 → S1 → SM, not as a one-hot. An MCInst with both SLOT_S0 and SLOT_S1 set routes to S0. Whether the scheduler ever sets more than one bit on a multi-scalar op was not traced (LOW), but the router's behavior if it does is deterministic and is the test order above, not an error.


The DMA, Optional-Skip, and No-Op Arms

DMA materialization (70 opcodes, 0xfa1..0x1024)

The DMA arm does not call a Get<Slot>Slot accessor. A DMA op is not a slot fill; it is a descriptor that the router materializes into the bundle's scalar_instruction oneof, then dispatches through a DMA-type variant visitor.

// DMA arm  (glc 0x139f1467)
oneof = *(u32*)(bundle + 0x38)                 // scalar_instruction oneof tag
switch oneof:                                   // precondition: must be empty
    case 2: LogFatal("!bundle.has_dma()",        isa_emitter.cc:157)
    case 6: LogFatal("!bundle.has_stream()",     isa_emitter.cc:158)
    case 1: LogFatal("!bundle.has_scalar_alu()", isa_emitter.cc:159)
if !(flags & 1): LogFatal("flags & SLOT_S0", isa_emitter.cc:161)   // requires S0
if !(flags & 2): LogFatal("flags & SLOT_S1", isa_emitter.cc:162)   // and S1
clear_scalar_instruction(bundle)               // 0x1fb59220
*(u32*)(bundle + 0x38) = 2                       // set oneof = dma
dma = Arena::DefaultConstruct<SparseCoreDma>(bundle.arena)   // 0x1fb5a480
*(u64*)(bundle + 0x30) = dma
return __variant_dispatch[0](visitor, dma)     // {SparseCoreDma*, SparseCoreTecDma*}

NOTE — the DMA arm requires both SLOT_S0 and SLOT_S1 set. Beyond the oneof-empty precondition, the decompile shows two further LogFatal asserts: a DMA op must carry both scalar-ALU slot-flag bits (SLOT_S0 and SLOT_S1). The DMA descriptor occupies the width of both dual-issue scalar lanes, so the scheduler must reserve both; a DMA MCInst missing either bit is a fatal compiler invariant violation, not a recoverable status. The variant over {SparseCoreDma*, SparseCoreTecDma*} then routes the simple/general/strided/iova DMA sub-consumers downstream.

Optional-skip (4 opcodes) and no-op (1 opcode)

// optional-skip arm  (glc 0x139f17bf)
case 0x100d: case 0x100e: case 0x1015: case 0x10f2:
    if (!tolerate_skip) goto DEFAULT             // bool = consumer's 4th argument
    return OK                                     // silently drop the op

// no-op arm  (glc 0x139f176a)
case 0x264:
    return OK                                     // no slot fill, no descriptor

The optional-skip arm gates four opcodes on the consumer's bool argument (char a4 in the decompile, the 4th parameter): when set, the four ops are silently accepted and produce no emission; when clear, they fall into the default-error arm. The single no-op opcode 0x264 always returns OK with no emission — it is the placement of a bundle slot that occupies no encoding (the all-zero NOP described on the TEC Engine page).

QUIRK — the optional-skip bool flips four opcodes between "silently dropped" and "hard error." The meaning of the flag (tolerate-padding vs. speculative-decode vs. a per-gen feature gate) was not traced to its caller (LOW). What is certain: with the flag clear, opcodes 0x100d/0x100e/0x1015/0x10f2 are unsupported; with it set, they vanish from the bundle without error. A reimplementer must thread this bool from the bundle-consume loop, or a stream containing those four ops either errors or round-trips inconsistently.


Bundle Invariance

ConsumeOneSlotInstruction is a template on the bundle type, and the SC emitter instantiates it three times — for the SCS, TAC, and TEC scalar regions. All three instances are byte-identical in classify structure.

BundleRouter entryJump tableBaseBoundArm distribution
SparseCoreTacBundle0x139f13600xae8dce40x1f30xfb22817/888/92/70/54/49/27/17/4/1
SparseCoreScsBundle0x13a505400xaea9fb40x1f30xfb2identical
SparseCoreTecBundle0x13a155000xaea4ba00x1f30xfb2identical

The opcode→arm map is the same across all three; only the bundle template argument on GetStreamSlot<…,Bundle>, GetScalarAluSlot<…,Bundle>, etc. differs. All three bundles carry the dual scalar-ALU sub-slots (GetScalarAluSlotS0/S1 are present and reached in each), so the S0/S1 dual-issue scalar geometry is a property of the SC scalar slot, not of any one engine.

NOTE — the TEC bundle's vector slots are routed elsewhere. ConsumeOneSlotInstruction<SparseCoreTecBundle> handles only the TEC bundle's scalar region (the same Stream/Misc/Alu/DMA slots SCS and TAC have). The TEC's vector slots are dispatched by the separate ConsumeOneTecBundleInstruction (0x13a08e00), described next. A reimplementer must not look for VectorAlu in the OneSlot router; it is not there.


Reaching VectorAlu — the Separate Vector Path

Why the vector dispatch is a different function

The TEC bundle is the only SC bundle with a vector compute region, and its vector slots (VectorAlu0/1/2, VectorLoad, VectorStore, VectorExtended, VectorResult) are routed by ConsumeOneTecBundleInstruction (0x13a08e00), which sits beside ConsumeOneSlotInstruction under the per-MCInst TEC dispatcher. The scalar router and the vector router share the same classify idiom — read DWORD[MCInst], subtract a base, bound-check, indirect-jump through a .rodata table — but they are distinct functions with distinct tables, because the scalar opcode space (0x1f3-based) and the vector opcode space (0xb26-based for VectorAlu) are disjoint.

ConsumeVectorAluInstruction — the 142-op reach

The VectorAlu slot's consumer is ConsumeVectorAluInstruction<glc::SparseCoreTecBundle> (0x13a0b580), reached from ConsumeOneTecBundleInstruction for any opcode in the VectorAlu block. Its dispatch is the same shape as the scalar router, with the vector base and bound:

// ConsumeVectorAluInstruction   // glc 0x13a0b580
//   args: (printer, mcinst, &vregports /*btree_set<SparsecoreVregReadPort>*/, &proto, &bundle)
function ConsumeVectorAluInstruction(printer, mcinst, vregports, proto, bundle):
    idx = mcinst.opcode - 0xb26                   // jt base 0xb26
    if (unsigned)idx > 0x5cf:                       // bound 0x5cf (1488 entries)
        return Error("Unsupported opcode for Vector Alu slot: $0 : $1", …)
    switch jt[idx]:                                // jt @0xae9d3dc, 143 targets
        case 0xb26: proto.mutable_vector_add_bf16();
                    return EmitVectorBinop<…VectorAddBf16,SparsecoreVregReadPort>(mcinst)
        case 0xc9f: proto.mutable_cosq_f32();
                    return EmitExtendedVectorVxUnop<…CosqF32>(mcinst)
        case 0xe87: GetOperandAndVsEncoding(mcinst, 1);
                    proto.mutable_pack_compressed_b16_to_b8();
                    return EmitPackVectorBinop<…PackCompressedB16ToB8>(mcinst)
        // 7 f32 compares + VectorMove share one oneof-dispatch chain [proto+0x50]
        default:    return Error(…)                 // 1213 opcodes

Each arm calls SparseCoreTecVectorAlu::_internal_mutable_<op>() to select the proto oneof field, then tail-jumps one of nine Emit* templates. The fifth template parameter SparsecoreVregReadPort (carried as a btree_set argument) is the per-bundle read-port reservation the bundle scheduler must satisfy across the three concurrent lanes. The 142 reachable ops — 135 single-op arms plus seven reached through one shared f32-compare/move oneof chain — their opcode values, emission templates, and per-generation deltas are the subject of the TEC Vector Opcode Enumeration page and are not duplicated here.

NOTE — the scalar and vector default-error strings are distinct .rodata literals. The VectorAlu default-error string is "Unsupported opcode for Vector Alu slot: $0 : $1". The scalar OneSlot router uses the "while consuming slot instruction" phrasing; the vector consumer uses "for Vector Alu slot." Both are byte-confirmed against the decompiled bodies.

GOTCHA — scalar and vector dispatch differ in signature, not just table. ConsumeOneSlotInstruction takes (printer, mcinst, &bundle, bool) and returns after filling one scalar slot. ConsumeVectorAluInstruction takes (printer, mcinst, &vregports, &proto, &bundle) — it additionally threads the SparsecoreVregReadPort btree and a pre-selected SparseCoreTecVectorAlu proto, because a vector op binds read ports the scheduler tracks per bundle. A reimplementer cannot reuse the scalar router's calling convention for the vector slots.


Function Map

SymbolAddressRole
ConsumeOneSlotInstruction<…TacBundle>0x139f1360the scalar-slot router (this page); jt base 0x1f3, bound 0xfb2
ConsumeOneSlotInstruction<…ScsBundle>0x13a50540SCS instance; jt 0xaea9fb4, identical arm map
ConsumeOneSlotInstruction<…TecBundle>0x13a15500TEC scalar instance; jt 0xaea4ba0, identical arm map
getSlotFlagsFromMCInst0x13c798e0return *(u32*)(mcinst+0x4) — the slot-flag word source
OneSlot jump table (TAC)0xae8dce44019×int32 rel offsets; 10 arm targets
GetStreamSlot<…,TacBundle>0x139fa760Stream slot accessor
GetScalarMiscSlot<…,TacBundle>0x139eeac0ScalarMisc slot accessor
GetScalarAluSlot<…,TacBundle>0x139f0800generic ScalarAlu slot accessor (StatusOr)
GetScalarAluSlotS0 / …S10x139f7300 / 0x139f74a0dual-issue sub-slot accessors
ConsumeStreamInstruction0x139fa940Stream slot leaf consumer
ConsumeScalarMiscInstruction0x139eeca0ScalarMisc slot leaf consumer
ConsumeScalarAluInstruction0x139f09c0generic ScalarAlu leaf consumer
ConsumeScalarAluSlotS0Instruction / …S10x139f9480 / 0x139f9be0dual-issue leaf consumers
clear_scalar_instruction0x1fb59220DMA arm: clears the scalar_instruction oneof
Arena::DefaultConstruct<SparseCoreDma>0x1fb5a480DMA arm: materializes the DMA descriptor
DMA variant dispatcher0x13a04820{SparseCoreDma*, SparseCoreTecDma*} value-visitor
MakeErrorImpl<9>0x2111e900both router error paths
ConsumeOneTecBundleInstruction0x13a08e00the separate TEC vector-slot dispatcher
ConsumeVectorAluInstruction<…TecBundle>0x13a0b580reaches the 142-op VectorAlu table; jt 0xae9d3dc, base 0xb26

Error strings (.rodata): "Unsupported opcode while consuming slot instruction: $0 : $1" (0x9e6fbec, default arm) and "Invalid slot. Expected Scalar Slot. MCInst Flags: $0" (0x9fbf02c, multi-scalar no-flag arm). Source file platforms/xla/sparse_core/ghostlite/isa_emitter.cc (0x8762dbb).


Considerations

  • Slot class is opcode-driven; sub-slot is flag-driven. The five fixed-slot arms decode from the opcode alone; the 54-op multi-scalar arm and the DMA arm read the SparseCoreMCSlot flag word (MCInst+0x4). A correct reimplementation needs both the 4019-entry table and the upstream flag-stamping discipline, or multi-slot ops mis-route.
  • The router is the placement seam, not the encoder. It chooses the slot and stamps predication (EmitPredicationToSlot); the per-slot Consume<Slot>Instruction leaf fills the slot body, and the <Slot>Encoder::Encode (BitCopy) writes the absolute bundle bits below that. The byte-level encoding lives on the per-slot pages, not here.
  • DMA preconditions are LogFatal, not status. The DMA arm's oneof-empty check and its SLOT_S0/SLOT_S1 requirement are compiler invariants (LogMessageFatal), so a violation aborts rather than returning an error status. A reimplementer must guarantee these upstream; they are not recoverable at the router.
  • Vector dispatch is a parallel mechanism, not a sub-arm. ConsumeOneSlotInstruction never reaches VectorAlu; the vector slots are routed by ConsumeOneTecBundleInstructionConsumeVectorAluInstruction, with a different signature (threading the SparsecoreVregReadPort btree and the SparseCoreTecVectorAlu proto). Treat the two routers as siblings under the TEC bundle dispatcher.
  • The optional-skip bool is an untraced policy input (LOW). Four opcodes flip between drop and error on the consumer's bool. The bit is byte-confirmed; its caller-side meaning is not. Thread it from the bundle-consume loop and treat the four opcodes as conditionally supported.

NameRelationship
ConsumeOneTecBundleInstruction (0x13a08e00)the per-MCInst TEC dispatcher above both this router and the vector consumer
ConsumeVectorAluInstruction (0x13a0b580)the sibling vector-slot consumer reaching the 142-op VectorAlu table
getSlotFlagsFromMCInst (0x13c798e0)the MCInst+0x4 slot-flag word the multi-scalar and DMA arms sub-route on
Get<Slot>Slot / Consume<Slot>Instruction familythe slot accessors and leaf consumers each arm tail-calls
EmitPredicationToSlot<…>stamps the op's predicate guard into the slot before the leaf consumer fills it

Cross-References

  • TEC (Vector) Engine — the 64-byte bundle, the scalar slot byte/bit bases, and the dual-issue S0/S1 geometry the router places ops into.
  • SCS Engine — the scalar control sequencer; ConsumeOneSlotInstruction<SparseCoreScsBundle> is the same router for the SCS scalar region.
  • TEC Vector Opcode Enumeration — the 142-op VectorAlu roster and its emission templates, reached through ConsumeVectorAluInstruction (linked, not duplicated, here).
  • SCS Scalar Opcode Enumeration — the scalar opcode roster and the setSlotFlagInMCInst discipline that stamps the SparseCoreMCSlot bits this router reads.
  • VectorLoad Slot — a TEC vector slot routed by ConsumeOneTecBundleInstruction, not this scalar router.
  • VectorStore Slot — the tile vector-store + scatter-add slot, likewise a vector-path slot.
  • VectorExtended (VEX) — the scan/sort/dedup vector slot, also reached through the vector dispatcher.
  • SparseCore Overview — the three engine classes, per-generation presence, and the codec-template sequencer enum.
  • Binary: extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so (build-id 89edbbe81c5b328a958fe628a9f2207d)
  • Index entry: Part IX — SparseCore & BarnaCore / SparseCore ISA — back to index