ResultFifo and ArchRegister Enums
Every offset, value, and address on this page was read byte-exactly from
libtpu.soin thelibtpu-0.0.40-cp314wheel (BuildID md589edbbe81c5b328a958fe628a9f2207d). Other versions differ.
Abstract
The TensorCore LLO layer names two flat enumerations that the per-opcode metadata table (opcode_info_big) and the cross-lane-unit (XLU) scheduler reference constantly: ResultFifo, the 25 hardware result FIFOs a bundle slot can push to or pop from, and ArchRegister, the 50-member physical architectural-register numbering that the per-opcode read/written register lists are expressed in. Neither is the same as the virtual LLO register space. ResultFifo is a hardware-resource enum (matmul staging banks, transpose banks, the EUP and cross-lane drains); ArchRegister is an instance-resolved physical slot index, the bottom layer beneath the gen-specific arch-register numbering documented in ArchRegno Numbering.
If you know LLVM, the closest analogy for ArchRegister is MCRegister — a target-physical register number — and for ResultFifo, a fixed pool of architecturally-visible queues like the x87 stack or a SIMD accumulator file, except that here several FIFOs are banked (multiple physical instances selected by a runtime instance index). The enums themselves carry no per-target sizing; the numbering that turns an ArchRegister ordinal into a printable v3 / s12 / vm5 / p2 is built at Target construction (see ArchRegno Numbering). This page pins both enums by name, decodes the banked-vs-single split, and traces the one consumer that reads them together — LloXluGraphOptimizer::ComputeXluOperations, which walks the dependency graph and emits one XLU operation per cross-lane opcode with its read and written ArchRegister sets attached.
For reimplementation, the contract is:
- The 25
ResultFifoordinals and what each FIFO is, plus theFifoInstancebank arithmetic that maps a(ResultFifo, instance)pair to a flat physical FIFO id, and the per-FIFO per-version depth (ResultFifoEntryCount). - The
RegisterType5-member enum and theArchRegister50-member physical numbering: which 12 ordinals are multi-instance banks, which 38 are single registers, and the(RegisterType, regno)pair each resolves to. ComputeXluOperations: the topological walk, the 21-opcode XLU selection, theopcode_info_big +0x08read-list /+0x14written-list construction, and thevariant<TransposeTile, RpuOperation, XluControlOperation>it emits per op.
ResultFifo stringify | ResultFifoToString @ 0x14441340 — 25 inline-immediate arms |
ResultFifo instance resolver | internal::FifoInstance @ 0x14446040 — 5 banked arms + pass-through |
ResultFifo depth | ResultFifoEntryCount @ 0x1d631520 — 25 arms, TpuVersion < 6 gate |
RegisterType stringifiers | RegisterTypeToString @ 0x1d640560, RegisterTypeToMnemonic @ 0x1d640600 |
ArchRegister instance resolver | internal::ArchRegisterInstance @ 0x126b3240 — 12 banked arms + default pass-through |
| Per-opcode metadata | opcode_info_big @ 0x227b5570 — LloOpcodeBigInfo[461] (symbol size 0x326c = 12908 B = 461 × 28), 28-byte stride |
| XLU consumer | LloXluGraphOptimizer::ComputeXluOperations @ 0x126d9780 |
RegisterType
RegisterType is the 5-member register-file class enum. Two stringifiers spell it byte-exactly, both as plain switch tables:
RegisterTypeToString@0x1d640560writes the long name into a small-string buffer (with the length stored atbuffer[23]).RegisterTypeToMnemonic@0x1d640600writes the one- or two-character prefix that arch-register names are built from.
| Ordinal | ToString | ToMnemonic | Length | Meaning | Opcode count |
|---|---|---|---|---|---|
| 0 | "none" | "" | 4 / 0 | No GPR result | 216 |
| 1 | "pregs" | "p" | 5 / 1 | Predicate register | 11 |
| 2 | "sregs" | "s" | 5 / 1 | Scalar register | 58 |
| 3 | "vmregs" | "vm" | 6 / 2 | Vector-mask register | 19 |
| 4 | "vregs" | "v" | 5 / 1 | Vector register | 157 |
The string immediates are byte-confirmed: case 1 stores the DWORD "preg" then 's', case 2 "sreg"+'s', case 4 "vreg"+'s'; cases 0 and 3 use a strcpy of "none" / "vmregs"; the mnemonic arm uses the two-byte immediates 'p', 's', 'v' and a strcpy("vm"). The "opcode count" column is the per-RegisterType histogram of which class each LLO opcode produces.
NOTE — the register allocator splits these five classes into two groups, witnessed by two assertion strings in the binary:
"type == RegisterType::kPreg || type == RegisterType::kVmreg"(the non-spillable predicate/mask group) and"type == RegisterType::kSreg || type == RegisterType::kVreg"(the spillable scalar/vector data group).kNoneis never allocated.
ArchRegister and its Physical Numbering
ArchRegister is a 50-member enum (ordinals 1..0x32) that is not a name set. It is a physical numbering: internal::ArchRegisterInstance(ArchRegister, optional<int> instance) @ 0x126b3240 maps an (enum, instance-index) pair to a flat physical arch-register slot. The function is a switch with twelve explicit arms — one per banked ordinal — over the value space 1..0x32; every other ordinal hits the default arm and is returned unchanged.
Banked versus single registers
Twelve ordinals are multi-instance banks — registers that exist in several physical copies and require a runtime instance selector. Their switch arms validate the instance index (*unit_id < count) then return ordinal + instance, so each bank occupies a contiguous physical run [ordinal .. ordinal + count - 1]. The other 38 ordinals fall through to the default arm and return the ordinal directly.
| Banked ordinal | Instance count | Physical base | Slot range |
|---|---|---|---|
| 0x01 | 3 | 1 | 1..3 |
| 0x05 | 3 | 5 | 5..7 |
| 0x0b | 4 | 11 | 11..14 |
| 0x0f | 4 | 15 | 15..18 |
| 0x13 | 4 | 19 | 19..22 |
| 0x17 | 4 | 23 | 23..26 |
| 0x1b | 4 | 27 | 27..30 |
| 0x1f | 4 | 31 | 31..34 |
| 0x26 | 4 | 38 | 38..41 |
| 0x2a | 4 | 42 | 42..45 |
| 0x2e | 4 | 46 | 46..49 |
| 0x32 | 4 | 50 | 50..53 |
The two count-3 banks (0x01, 0x05) assert *unit_id < 3; the ten count-4 banks assert *unit_id < 4. The base always equals the ordinal (return ordinal + instance).
There is no static ArchRegister-to-name table: no static ArchRegister->ToString exists, so a per-ordinal symbolic name (loop counter vs sync-flag bank, etc.) is not resolvable from this enum alone.
GOTCHA — there is no static
ArchRegister-to-name table. The printable name of any arch register is produced one level up byRegisterNumbering::ToArchRegString@0x1275e2a0, which prints"<RegisterTypeToMnemonic(type)><regno>"(e.g.v3,s12,vm5,p2) after resolving the slot through the per-Targetnumbering table built byTarget::InitRegisterNumbering. So "ArchRegister 23" does not have a fixed name — it prints asvN/sN/ etc. only once the target's numbering is bound. See ArchRegno Numbering.
The MRB pseudo-register namespace
Above the ~50 real arch registers sits a pseudo-register namespace for matmul-result-buffer (MRB) entries. PseudoArchRegisterFromMrbEntry @ 0x1443c860 returns (mrb_id << 9) + mrb_entry + 0x36 with mrb_entry < 0x200 and mrb_id < 4 — four banks of 512 entries, based at physical slot 0x36, immediately above the real arch registers. These are the FIFO-backed pseudo-register slots the scheduler tracks for result-buffer liveness.
ResultFifo
ResultFifo enumerates the 25 hardware result FIFOs (ordinals 0..0x18). The authoritative name source is ResultFifoToString @ 0x14441340, a pure switch where each arm writes the name as inline ASCII immediates and stores the length at buffer[23]. The count is independently cross-confirmed by ResultFifoEntryCount @ 0x1d631520, which has the same 25 valid arms.
| Ordinal | Name | FIFO class |
|---|---|---|
| 0x00–0x03 | kMsrA0..kMsrA3 | Matmul staging-result, bank A, instances 0..3 |
| 0x04–0x07 | kMsrB0..kMsrB3 | Matmul staging-result, bank B, instances 0..3 |
| 0x08–0x0b | kMrf0..kMrf3 | Matmul result FIFO, instances 0..3 |
| 0x0c–0x0e | kTsf0..kTsf2 | Transpose staging FIFO, instances 0..2 |
| 0x0f–0x11 | kTrf0..kTrf2 | Transpose result FIFO, instances 0..2 |
| 0x12 | kErf | EUP result FIFO (transcendental/activation drain) |
| 0x13 | kV2sf | Vector-to-scalar FIFO (vector→scalar bridge) |
| 0x14 | kSfrf | Sync-flag result FIFO (sync-flag read-back) |
| 0x15 | kCrf | Cross-lane result FIFO (XLU permute/reduce result) |
| 0x16 | kDrf | DivRem result FIFO (scalar divide/remainder) |
| 0x17 | kSccf | Cross-core / channel result FIFO |
| 0x18 | kCcrf | Cmem / cross-core result FIFO |
The names are byte-exact: arms 0..7 are strcpy("kMsrA0") … strcpy("kMsrB3"); arms 8..17 store the DWORD prefix ("kMrf", "kTsf", "kTrf") plus a single digit byte; arms 18..24 store "kErf", "kV2sf", "kSfrf", "kCrf", "kDrf", "kSccf", "kCcrf". The class gloss for the first ten staging/result FIFOs is anchored by both the name prefix and the FifoInstance bank arithmetic; the kSccf / kCcrf class follows the name prefix.
QUIRK — a sibling stringifier
MsrToString@0x1d629720confirms the two matmul staging-result banks are"msra"/"msrb", matchingkMsrA*/kMsrB*. A separateXmrToString@0x1d629740names the matrix-register file"gmra"(gain-matrix register) and"lmr"(latch-matrix register) — these are a different register file fromResultFifo, surfaced in Slot: MXU.
FifoInstance — the bank arithmetic
internal::FifoInstance(ResultFifo, optional<int> instance) @ 0x14446040 collapses a (base-FIFO, instance) pair into a flat physical FIFO id. Only the multi-instance staging/result FIFOs have dedicated switch arms; everything else passes through unchanged.
| Base ordinal | FIFO | Instance count | Returns |
|---|---|---|---|
| 0x00 | kMsrA0 | 4 (*unit_id < 4) | instance |
| 0x04 | kMsrB0 | 4 | instance + 4 |
| 0x08 | kMrf0 | 4 | instance + 8 |
| 0x0c | kTsf0 | 3 (*unit_id < 3) | instance + 12 |
| 0x0f | kTrf0 | 3 | instance + 15 |
| (other) | — | — | ordinal (pass-through) |
So the instance-bearing physical domain is the matmul + transpose staging/result banks; the higher ordinals (kTrf1/2 as named enum values, kErf, kV2sf, kSfrf, kCrf, kDrf, kSccf, kCcrf) are single-instance FIFOs that do not pass through the bank arithmetic.
GOTCHA —
FifoInstance'scmp edi,0xfis not the enum size: that0..0xfbound is the physical-instance sub-domain (the banks that take an instance index).ResultFifoToStringandResultFifoEntryCountboth show 25 members (0..0x18) — the enum is 25 wide, and only the first 16 ordinals participate inFifoInstancearithmetic.
ResultFifoEntryCount — per-FIFO, per-version depth
ResultFifoEntryCount(ResultFifo, TpuVersion) @ 0x1d631520 returns the buffer depth of a FIFO. It is a 25-arm switch over the FIFO ordinal; each arm gates on TpuVersion < 6 (versions 6 and above hit a "invalid platform type" fatal). The depth resolution falls into two patterns:
- Constant depth, independent of version:
kTsf0/kTsf1/kTsf2(cases 12–14) all return 16;kSfrf(case 19) returns 128. - Version-keyed table: every other FIFO indexes a per-FIFO
int[]table byTpuVersion. Several arms share a table (e.g. cases 0–2 read one table, 4–6 another), reflecting FIFO groups whose depths track silicon generation together.
function ResultFifoEntryCount(fifo, version): // sub_1d631520
switch (fifo):
case kTsf0..kTsf2: // cases 12-14
if (version >= 6) Fatal("invalid platform type")
return 16 // depth fixed across gens
case kSfrf: // case 19
if (version >= 6) Fatal("invalid platform type")
return 128
case <grouped FIFOs>:
if (version >= 6) Fatal("invalid platform type")
return depth_table_for_group[version] // version-indexed int[]
NOTE — the
TpuVersion < 6gate means this build's depth table coverskJellyfish(0) through a version-5 codename; version 6+ is a not-yet-supported platform. The full 25 ×TpuVersiondepth matrix is not enumerated cell-by-cell here.
opcode_info_big — Where the Enums Are Consumed
Both enums are referenced from the per-opcode metadata table opcode_info_big @ 0x227b5570 (LloOpcodeBigInfo[461] — symbol size 0x326c = 12908 B = 461 × 28 — 28-byte stride, indexed by LloOpcode; ComputeXluOperations bounds the index with opcode < 0x1CE and traps otherwise, so the bound admits one index past the 461-entry table). Each record carries three sentinel-terminated int8 lists:
struct LloOpcodeBigInfo { // sizeof 28 (0x1c)
int8_t result_fifos[8]; // +0x00 : ResultFifo 0..0x18, neg-terminated (forward reader)
int8_t arch_registers_read[12]; // +0x08 : ArchRegister 1..0x32, neg-terminated (-12 counter reader)
int8_t arch_registers_written[8]; // +0x14 : ArchRegister 1..0x32, neg-terminated (forward reader)
};
The read list at +0x08 is read by a loop that starts a counter at -12 and indexes record[counter + 0x14], so it sweeps offsets +0x08..+0x13 (12 entries) and stops at the first negative byte. The written list at +0x14 and the result-fifo list at +0x00 are read forward. Each ArchRegister code is resolved to a physical slot through ArchRegisterInstance, with the instance index drawn from the LLO value's metadata word (below). See opcode_info_big for the full descriptor.
NOTE — the
+0x08..+0x13field is thearch_registers_read[12]list. Its three readers —ComputeXluOperationsand bothGetPseudoArchRegistersRead<…>instantiations — each start a loop counter at-12(0xfffffffffffffff4) and add it to a+0x14displacement, which lands the read at+0x08, not+0x14. The consumer namedGetPseudoArchRegistersReadconfirms the field is the registers-read list.
LloXluGraphOptimizer::ComputeXluOperations
Purpose
ComputeXluOperations @ 0x126d9780 builds the list of cross-lane-unit (XLU) operations plus, for each, the read and written ArchRegister sets. That per-op register dataflow is what the XLU scheduler turns into a cross-XLU dependency graph (see ArchRegno Numbering). This is the single consumer that reads both ArchRegister lists and emits the XLU-operation variant.
Algorithm
function ComputeXluOperations(this): // sub_126d9780
nodes = LloDependencyGraph::NodesInTopologicalOrder(false) // sub_1442b8c0
for node in nodes:
op = WORD[node.value] // LloOpcode
// dispatch: lea ecx,[op-0x8b]; cmp 0xca; ja skip → JT[op-0x8b]
if not (0x8b <= op <= 0x155) or JT[op-0x8b] == skip_arm:
continue // 182 of 203 band opcodes are non-XLU
// ---- this is one of the 21 XLU opcodes ----
instance = nullopt
meta = WORD[node.value + 0xb]
if (meta & 0x400): // explicit-instance flag
instance = ((meta >> 8) & 3) | has_value // optional<int>
// READ set: opcode_info_big +0x08 list, -12 loop, neg-terminated
record = &opcode_info_big[28 * op]
for c = -12; c < 0; ++c:
code = record[c + 0x14] // offsets +0x08..+0x13
if code < 0: break
read_set.emplace(ArchRegisterInstance(code, instance)) // InlinedVector<ArchRegister,2>
// WRITTEN set: forward +0x14 list
written_set = LloOpcodeArchRegistersWritten(op, instance) // sub_126b2ea0
// emit one variant per op (stride 0x48, discriminant at +0x40)
emit variant<TransposeTile, RpuOperation, XluControlOperation>(op, read_set, written_set)
return StatusOr<vector<variant<...>>>
The dispatch is a jump table at 0xadf5504 indexed by op - 0x8b (203 entries) with only two arms: an XLU arm (21 opcodes) and a skip arm (182 opcodes). Out-of-band opcodes (< 0x8b or > 0x155) skip too.
The 21 XLU opcodes
| Opcode | Name | Variant |
|---|---|---|
| 0x08b | kVectorSetPermutePattern | XluControlOperation |
| 0x08c | kVectorSetSegmentPattern | XluControlOperation |
| 0x0a6 | kVectorTranspose | TransposeTile |
| 0x0a7 | kVectorTransposeBinary | TransposeTile |
| 0x0f5 | kVectorMinReduceF32 | RpuOperation |
| 0x0f6 | kVectorMaxReduceF32 | RpuOperation |
| 0x0f7 | kVectorAddReduceF32 | RpuOperation |
| 0x0f8 | kVectorMaxIndexReduceF32 | RpuOperation |
| 0x0f9 | kVectorMinIndexReduceF32 | RpuOperation |
| 0x0fa | kVectorMaxSegmentReduceF32 | RpuOperation |
| 0x0fb | kVectorMinSegmentReduceF32 | RpuOperation |
| 0x0fc | kVectorAddSegmentReduceF32 | RpuOperation |
| 0x0fd | kVectorMinReduceBf16 | RpuOperation |
| 0x0fe | kVectorMaxReduceBf16 | RpuOperation |
| 0x0ff | kVectorAddReduceBf16 | RpuOperation |
| 0x100 | kVectorMaxIndexReduceBf16 | RpuOperation |
| 0x101 | kVectorMinIndexReduceBf16 | RpuOperation |
| 0x14f | kVectorXlaneResult | XluControlOperation (pops kCrf) |
| 0x150 | kVectorPermuteResult | XluControlOperation |
| 0x154 | kVectorTransposeResult | TransposeTile (pops kTrf*) |
| 0x155 | kVectorTransposeClear | TransposeTile |
The variant assignment is decided at emission time by four byte-exact classifiers (LloOpcodeUsesTranspose, LloOpcodeUsesRpu, LloOpcodeIsRpuControl, LloOpcodeIsRpuResult); the per-opcode → variant cells are confirmed in ArchRegno Numbering. See XLU Op Roster for the opcode→factory mapping at the encode side.
NOTE — the reduce family (0xf5..0x101) is contiguous: F32 min/max/add reduce, F32 max/min index-reduce, F32 max/min/add segment-reduce, then the bf16 mirror (min/max/add reduce + max/min index-reduce). This contiguity is exactly the range
LloOpcodeUsesRputests with(op - 0xf5) < 0xd.
A worked example — kVectorAddReduceF32 (0xf7)
- Topological walk reaches the node; opcode
0xf7. Dispatch index0xf7 - 0x8b = 0x6c; the jump table selects the XLU arm. - Read set:
opcode_info_big[0xf7] + 0x08list, each code resolved viaArchRegisterInstance(code, instance)whereinstance = (WORD[value+0xb] >> 8) & 3if bit0x400is set, elsenullopt. This op reads the vreg holding the reduce source. - Written set:
LloOpcodeArchRegistersWritten(0xf7)reads the+0x14forward list — the cross-lane-result FIFO (kCrf) backing register the result lands in. - Emit
RpuOperationinto the XLU-op vector. The read/written sets feed the cross-XLU dependency tracker, so a laterkVectorXlaneResult(0x14f) that readskCrfis ordered after this reduce that writes it.
Related Components
| Component | Relationship |
|---|---|
opcode_info_big (record format) | Holds the result_fifos / arch_registers_read / arch_registers_written lists keyed by these enums |
RegisterNumbering (archregno numbering) | Turns an ArchRegister physical slot into a printable (RegisterType, regno) per target |
| XLU scheduler (archregno numbering) | Consumes the per-op read/written ArchRegister sets ComputeXluOperations builds |
Cross-References
- ArchRegno Numbering — how
ArchRegisterordinals become per-gen(RegisterType, regno)arch-register numbers, and the cross-XLU dependency tracker that consumes the read/written sets - XLU Op Roster — the opcode→factory table for the cross-lane unit ops named here
- Slot: VPU — the vector-processing slot whose ops produce the vregs these FIFOs drain
- Slot: MXU — the matmul slot that fills the
kMsr*/kMrf*staging and result FIFOs - MC Emitter — the machine-code emitter that lays these ops into bundle slots
- opcode_info_big Record Format — the 28-byte per-opcode descriptor the enums are stored in
- Bundle Model — the VLIW bundle these slots are issued from