Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ResultFifo and ArchRegister Enums

Every offset, value, and address on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). Other versions differ.

Abstract

The TensorCore LLO layer names two flat enumerations that the per-opcode metadata table (opcode_info_big) and the cross-lane-unit (XLU) scheduler reference constantly: ResultFifo, the 25 hardware result FIFOs a bundle slot can push to or pop from, and ArchRegister, the 50-member physical architectural-register numbering that the per-opcode read/written register lists are expressed in. Neither is the same as the virtual LLO register space. ResultFifo is a hardware-resource enum (matmul staging banks, transpose banks, the EUP and cross-lane drains); ArchRegister is an instance-resolved physical slot index, the bottom layer beneath the gen-specific arch-register numbering documented in ArchRegno Numbering.

If you know LLVM, the closest analogy for ArchRegister is MCRegister — a target-physical register number — and for ResultFifo, a fixed pool of architecturally-visible queues like the x87 stack or a SIMD accumulator file, except that here several FIFOs are banked (multiple physical instances selected by a runtime instance index). The enums themselves carry no per-target sizing; the numbering that turns an ArchRegister ordinal into a printable v3 / s12 / vm5 / p2 is built at Target construction (see ArchRegno Numbering). This page pins both enums by name, decodes the banked-vs-single split, and traces the one consumer that reads them together — LloXluGraphOptimizer::ComputeXluOperations, which walks the dependency graph and emits one XLU operation per cross-lane opcode with its read and written ArchRegister sets attached.

For reimplementation, the contract is:

  • The 25 ResultFifo ordinals and what each FIFO is, plus the FifoInstance bank arithmetic that maps a (ResultFifo, instance) pair to a flat physical FIFO id, and the per-FIFO per-version depth (ResultFifoEntryCount).
  • The RegisterType 5-member enum and the ArchRegister 50-member physical numbering: which 12 ordinals are multi-instance banks, which 38 are single registers, and the (RegisterType, regno) pair each resolves to.
  • ComputeXluOperations: the topological walk, the 21-opcode XLU selection, the opcode_info_big +0x08 read-list / +0x14 written-list construction, and the variant<TransposeTile, RpuOperation, XluControlOperation> it emits per op.
ResultFifo stringifyResultFifoToString @ 0x14441340 — 25 inline-immediate arms
ResultFifo instance resolverinternal::FifoInstance @ 0x14446040 — 5 banked arms + pass-through
ResultFifo depthResultFifoEntryCount @ 0x1d631520 — 25 arms, TpuVersion < 6 gate
RegisterType stringifiersRegisterTypeToString @ 0x1d640560, RegisterTypeToMnemonic @ 0x1d640600
ArchRegister instance resolverinternal::ArchRegisterInstance @ 0x126b3240 — 12 banked arms + default pass-through
Per-opcode metadataopcode_info_big @ 0x227b5570LloOpcodeBigInfo[461] (symbol size 0x326c = 12908 B = 461 × 28), 28-byte stride
XLU consumerLloXluGraphOptimizer::ComputeXluOperations @ 0x126d9780

RegisterType

RegisterType is the 5-member register-file class enum. Two stringifiers spell it byte-exactly, both as plain switch tables:

  • RegisterTypeToString @ 0x1d640560 writes the long name into a small-string buffer (with the length stored at buffer[23]).
  • RegisterTypeToMnemonic @ 0x1d640600 writes the one- or two-character prefix that arch-register names are built from.
OrdinalToStringToMnemonicLengthMeaningOpcode count
0"none"""4 / 0No GPR result216
1"pregs""p"5 / 1Predicate register11
2"sregs""s"5 / 1Scalar register58
3"vmregs""vm"6 / 2Vector-mask register19
4"vregs""v"5 / 1Vector register157

The string immediates are byte-confirmed: case 1 stores the DWORD "preg" then 's', case 2 "sreg"+'s', case 4 "vreg"+'s'; cases 0 and 3 use a strcpy of "none" / "vmregs"; the mnemonic arm uses the two-byte immediates 'p', 's', 'v' and a strcpy("vm"). The "opcode count" column is the per-RegisterType histogram of which class each LLO opcode produces.

NOTE — the register allocator splits these five classes into two groups, witnessed by two assertion strings in the binary: "type == RegisterType::kPreg || type == RegisterType::kVmreg" (the non-spillable predicate/mask group) and "type == RegisterType::kSreg || type == RegisterType::kVreg" (the spillable scalar/vector data group). kNone is never allocated.


ArchRegister and its Physical Numbering

ArchRegister is a 50-member enum (ordinals 1..0x32) that is not a name set. It is a physical numbering: internal::ArchRegisterInstance(ArchRegister, optional<int> instance) @ 0x126b3240 maps an (enum, instance-index) pair to a flat physical arch-register slot. The function is a switch with twelve explicit arms — one per banked ordinal — over the value space 1..0x32; every other ordinal hits the default arm and is returned unchanged.

Banked versus single registers

Twelve ordinals are multi-instance banks — registers that exist in several physical copies and require a runtime instance selector. Their switch arms validate the instance index (*unit_id < count) then return ordinal + instance, so each bank occupies a contiguous physical run [ordinal .. ordinal + count - 1]. The other 38 ordinals fall through to the default arm and return the ordinal directly.

Banked ordinalInstance countPhysical baseSlot range
0x01311..3
0x05355..7
0x0b41111..14
0x0f41515..18
0x1341919..22
0x1742323..26
0x1b42727..30
0x1f43131..34
0x2643838..41
0x2a44242..45
0x2e44646..49
0x3245050..53

The two count-3 banks (0x01, 0x05) assert *unit_id < 3; the ten count-4 banks assert *unit_id < 4. The base always equals the ordinal (return ordinal + instance).

There is no static ArchRegister-to-name table: no static ArchRegister->ToString exists, so a per-ordinal symbolic name (loop counter vs sync-flag bank, etc.) is not resolvable from this enum alone.

GOTCHA — there is no static ArchRegister-to-name table. The printable name of any arch register is produced one level up by RegisterNumbering::ToArchRegString @ 0x1275e2a0, which prints "<RegisterTypeToMnemonic(type)><regno>" (e.g. v3, s12, vm5, p2) after resolving the slot through the per-Target numbering table built by Target::InitRegisterNumbering. So "ArchRegister 23" does not have a fixed name — it prints as vN / sN / etc. only once the target's numbering is bound. See ArchRegno Numbering.

The MRB pseudo-register namespace

Above the ~50 real arch registers sits a pseudo-register namespace for matmul-result-buffer (MRB) entries. PseudoArchRegisterFromMrbEntry @ 0x1443c860 returns (mrb_id << 9) + mrb_entry + 0x36 with mrb_entry < 0x200 and mrb_id < 4 — four banks of 512 entries, based at physical slot 0x36, immediately above the real arch registers. These are the FIFO-backed pseudo-register slots the scheduler tracks for result-buffer liveness.


ResultFifo

ResultFifo enumerates the 25 hardware result FIFOs (ordinals 0..0x18). The authoritative name source is ResultFifoToString @ 0x14441340, a pure switch where each arm writes the name as inline ASCII immediates and stores the length at buffer[23]. The count is independently cross-confirmed by ResultFifoEntryCount @ 0x1d631520, which has the same 25 valid arms.

OrdinalNameFIFO class
0x00–0x03kMsrA0..kMsrA3Matmul staging-result, bank A, instances 0..3
0x04–0x07kMsrB0..kMsrB3Matmul staging-result, bank B, instances 0..3
0x08–0x0bkMrf0..kMrf3Matmul result FIFO, instances 0..3
0x0c–0x0ekTsf0..kTsf2Transpose staging FIFO, instances 0..2
0x0f–0x11kTrf0..kTrf2Transpose result FIFO, instances 0..2
0x12kErfEUP result FIFO (transcendental/activation drain)
0x13kV2sfVector-to-scalar FIFO (vector→scalar bridge)
0x14kSfrfSync-flag result FIFO (sync-flag read-back)
0x15kCrfCross-lane result FIFO (XLU permute/reduce result)
0x16kDrfDivRem result FIFO (scalar divide/remainder)
0x17kSccfCross-core / channel result FIFO
0x18kCcrfCmem / cross-core result FIFO

The names are byte-exact: arms 0..7 are strcpy("kMsrA0")strcpy("kMsrB3"); arms 8..17 store the DWORD prefix ("kMrf", "kTsf", "kTrf") plus a single digit byte; arms 18..24 store "kErf", "kV2sf", "kSfrf", "kCrf", "kDrf", "kSccf", "kCcrf". The class gloss for the first ten staging/result FIFOs is anchored by both the name prefix and the FifoInstance bank arithmetic; the kSccf / kCcrf class follows the name prefix.

QUIRK — a sibling stringifier MsrToString @ 0x1d629720 confirms the two matmul staging-result banks are "msra" / "msrb", matching kMsrA* / kMsrB*. A separate XmrToString @ 0x1d629740 names the matrix-register file "gmra" (gain-matrix register) and "lmr" (latch-matrix register) — these are a different register file from ResultFifo, surfaced in Slot: MXU.

FifoInstance — the bank arithmetic

internal::FifoInstance(ResultFifo, optional<int> instance) @ 0x14446040 collapses a (base-FIFO, instance) pair into a flat physical FIFO id. Only the multi-instance staging/result FIFOs have dedicated switch arms; everything else passes through unchanged.

Base ordinalFIFOInstance countReturns
0x00kMsrA04 (*unit_id < 4)instance
0x04kMsrB04instance + 4
0x08kMrf04instance + 8
0x0ckTsf03 (*unit_id < 3)instance + 12
0x0fkTrf03instance + 15
(other)ordinal (pass-through)

So the instance-bearing physical domain is the matmul + transpose staging/result banks; the higher ordinals (kTrf1/2 as named enum values, kErf, kV2sf, kSfrf, kCrf, kDrf, kSccf, kCcrf) are single-instance FIFOs that do not pass through the bank arithmetic.

GOTCHA — FifoInstance's cmp edi,0xf is not the enum size: that 0..0xf bound is the physical-instance sub-domain (the banks that take an instance index). ResultFifoToString and ResultFifoEntryCount both show 25 members (0..0x18) — the enum is 25 wide, and only the first 16 ordinals participate in FifoInstance arithmetic.

ResultFifoEntryCount — per-FIFO, per-version depth

ResultFifoEntryCount(ResultFifo, TpuVersion) @ 0x1d631520 returns the buffer depth of a FIFO. It is a 25-arm switch over the FIFO ordinal; each arm gates on TpuVersion < 6 (versions 6 and above hit a "invalid platform type" fatal). The depth resolution falls into two patterns:

  • Constant depth, independent of version: kTsf0/kTsf1/kTsf2 (cases 12–14) all return 16; kSfrf (case 19) returns 128.
  • Version-keyed table: every other FIFO indexes a per-FIFO int[] table by TpuVersion. Several arms share a table (e.g. cases 0–2 read one table, 4–6 another), reflecting FIFO groups whose depths track silicon generation together.
function ResultFifoEntryCount(fifo, version):        // sub_1d631520
    switch (fifo):
        case kTsf0..kTsf2:                           // cases 12-14
            if (version >= 6) Fatal("invalid platform type")
            return 16                                // depth fixed across gens
        case kSfrf:                                  // case 19
            if (version >= 6) Fatal("invalid platform type")
            return 128
        case <grouped FIFOs>:
            if (version >= 6) Fatal("invalid platform type")
            return depth_table_for_group[version]    // version-indexed int[]

NOTE — the TpuVersion < 6 gate means this build's depth table covers kJellyfish(0) through a version-5 codename; version 6+ is a not-yet-supported platform. The full 25 × TpuVersion depth matrix is not enumerated cell-by-cell here.


opcode_info_big — Where the Enums Are Consumed

Both enums are referenced from the per-opcode metadata table opcode_info_big @ 0x227b5570 (LloOpcodeBigInfo[461] — symbol size 0x326c = 12908 B = 461 × 28 — 28-byte stride, indexed by LloOpcode; ComputeXluOperations bounds the index with opcode < 0x1CE and traps otherwise, so the bound admits one index past the 461-entry table). Each record carries three sentinel-terminated int8 lists:

struct LloOpcodeBigInfo {              // sizeof 28 (0x1c)
    int8_t result_fifos[8];            // +0x00 : ResultFifo 0..0x18, neg-terminated (forward reader)
    int8_t arch_registers_read[12];    // +0x08 : ArchRegister 1..0x32, neg-terminated (-12 counter reader)
    int8_t arch_registers_written[8];  // +0x14 : ArchRegister 1..0x32, neg-terminated (forward reader)
};

The read list at +0x08 is read by a loop that starts a counter at -12 and indexes record[counter + 0x14], so it sweeps offsets +0x08..+0x13 (12 entries) and stops at the first negative byte. The written list at +0x14 and the result-fifo list at +0x00 are read forward. Each ArchRegister code is resolved to a physical slot through ArchRegisterInstance, with the instance index drawn from the LLO value's metadata word (below). See opcode_info_big for the full descriptor.

NOTE — the +0x08..+0x13 field is the arch_registers_read[12] list. Its three readers — ComputeXluOperations and both GetPseudoArchRegistersRead<…> instantiations — each start a loop counter at -12 (0xfffffffffffffff4) and add it to a +0x14 displacement, which lands the read at +0x08, not +0x14. The consumer named GetPseudoArchRegistersRead confirms the field is the registers-read list.


LloXluGraphOptimizer::ComputeXluOperations

Purpose

ComputeXluOperations @ 0x126d9780 builds the list of cross-lane-unit (XLU) operations plus, for each, the read and written ArchRegister sets. That per-op register dataflow is what the XLU scheduler turns into a cross-XLU dependency graph (see ArchRegno Numbering). This is the single consumer that reads both ArchRegister lists and emits the XLU-operation variant.

Algorithm

function ComputeXluOperations(this):                 // sub_126d9780
    nodes = LloDependencyGraph::NodesInTopologicalOrder(false)  // sub_1442b8c0
    for node in nodes:
        op = WORD[node.value]                         // LloOpcode
        // dispatch: lea ecx,[op-0x8b]; cmp 0xca; ja skip → JT[op-0x8b]
        if not (0x8b <= op <= 0x155) or JT[op-0x8b] == skip_arm:
            continue                                  // 182 of 203 band opcodes are non-XLU
        // ---- this is one of the 21 XLU opcodes ----
        instance = nullopt
        meta = WORD[node.value + 0xb]
        if (meta & 0x400):                            // explicit-instance flag
            instance = ((meta >> 8) & 3) | has_value  // optional<int>

        // READ set: opcode_info_big +0x08 list, -12 loop, neg-terminated
        record = &opcode_info_big[28 * op]
        for c = -12; c < 0; ++c:
            code = record[c + 0x14]                   // offsets +0x08..+0x13
            if code < 0: break
            read_set.emplace(ArchRegisterInstance(code, instance))   // InlinedVector<ArchRegister,2>

        // WRITTEN set: forward +0x14 list
        written_set = LloOpcodeArchRegistersWritten(op, instance)    // sub_126b2ea0

        // emit one variant per op (stride 0x48, discriminant at +0x40)
        emit variant<TransposeTile, RpuOperation, XluControlOperation>(op, read_set, written_set)
    return StatusOr<vector<variant<...>>>

The dispatch is a jump table at 0xadf5504 indexed by op - 0x8b (203 entries) with only two arms: an XLU arm (21 opcodes) and a skip arm (182 opcodes). Out-of-band opcodes (< 0x8b or > 0x155) skip too.

The 21 XLU opcodes

OpcodeNameVariant
0x08bkVectorSetPermutePatternXluControlOperation
0x08ckVectorSetSegmentPatternXluControlOperation
0x0a6kVectorTransposeTransposeTile
0x0a7kVectorTransposeBinaryTransposeTile
0x0f5kVectorMinReduceF32RpuOperation
0x0f6kVectorMaxReduceF32RpuOperation
0x0f7kVectorAddReduceF32RpuOperation
0x0f8kVectorMaxIndexReduceF32RpuOperation
0x0f9kVectorMinIndexReduceF32RpuOperation
0x0fakVectorMaxSegmentReduceF32RpuOperation
0x0fbkVectorMinSegmentReduceF32RpuOperation
0x0fckVectorAddSegmentReduceF32RpuOperation
0x0fdkVectorMinReduceBf16RpuOperation
0x0fekVectorMaxReduceBf16RpuOperation
0x0ffkVectorAddReduceBf16RpuOperation
0x100kVectorMaxIndexReduceBf16RpuOperation
0x101kVectorMinIndexReduceBf16RpuOperation
0x14fkVectorXlaneResultXluControlOperation (pops kCrf)
0x150kVectorPermuteResultXluControlOperation
0x154kVectorTransposeResultTransposeTile (pops kTrf*)
0x155kVectorTransposeClearTransposeTile

The variant assignment is decided at emission time by four byte-exact classifiers (LloOpcodeUsesTranspose, LloOpcodeUsesRpu, LloOpcodeIsRpuControl, LloOpcodeIsRpuResult); the per-opcode → variant cells are confirmed in ArchRegno Numbering. See XLU Op Roster for the opcode→factory mapping at the encode side.

NOTE — the reduce family (0xf5..0x101) is contiguous: F32 min/max/add reduce, F32 max/min index-reduce, F32 max/min/add segment-reduce, then the bf16 mirror (min/max/add reduce + max/min index-reduce). This contiguity is exactly the range LloOpcodeUsesRpu tests with (op - 0xf5) < 0xd.

A worked example — kVectorAddReduceF32 (0xf7)

  1. Topological walk reaches the node; opcode 0xf7. Dispatch index 0xf7 - 0x8b = 0x6c; the jump table selects the XLU arm.
  2. Read set: opcode_info_big[0xf7] + 0x08 list, each code resolved via ArchRegisterInstance(code, instance) where instance = (WORD[value+0xb] >> 8) & 3 if bit 0x400 is set, else nullopt. This op reads the vreg holding the reduce source.
  3. Written set: LloOpcodeArchRegistersWritten(0xf7) reads the +0x14 forward list — the cross-lane-result FIFO (kCrf) backing register the result lands in.
  4. Emit RpuOperation into the XLU-op vector. The read/written sets feed the cross-XLU dependency tracker, so a later kVectorXlaneResult (0x14f) that reads kCrf is ordered after this reduce that writes it.

ComponentRelationship
opcode_info_big (record format)Holds the result_fifos / arch_registers_read / arch_registers_written lists keyed by these enums
RegisterNumbering (archregno numbering)Turns an ArchRegister physical slot into a printable (RegisterType, regno) per target
XLU scheduler (archregno numbering)Consumes the per-op read/written ArchRegister sets ComputeXluOperations builds

Cross-References

  • ArchRegno Numbering — how ArchRegister ordinals become per-gen (RegisterType, regno) arch-register numbers, and the cross-XLU dependency tracker that consumes the read/written sets
  • XLU Op Roster — the opcode→factory table for the cross-lane unit ops named here
  • Slot: VPU — the vector-processing slot whose ops produce the vregs these FIFOs drain
  • Slot: MXU — the matmul slot that fills the kMsr* / kMrf* staging and result FIFOs
  • MC Emitter — the machine-code emitter that lays these ops into bundle slots
  • opcode_info_big Record Format — the 28-byte per-opcode descriptor the enums are stored in
  • Bundle Model — the VLIW bundle these slots are issued from