ArchRegno Numbering
Every offset, value, and address on this page was read byte-exactly from
libtpu.soin thelibtpu-0.0.40-cp314wheel (BuildID md589edbbe81c5b328a958fe628a9f2207d). Other versions differ.
Abstract
The TensorCore encoder needs a single dense integer per architectural register at the moment it emits a bundle slot, but the LLO IR carries virtual registers typed by RegisterType (preg / sreg / vmreg / vreg) plus a regno within that class. The bridge is RegisterNumbering: a per-Target object that assigns every (RegisterType, regno) a sequential arch-register number (arch_regno), keeping both a forward map ((type, regno) → arch_regno, read by ToRegNum) and a reverse vector (arch_regno → (type, regno), read by ToArchRegno). Because the register-file sizes differ per silicon generation, the numbering is rebuilt at every Target construction from four generation-specific count fields. This page documents that build — Target::InitRegisterNumbering and RegisterNumbering::Init / AddRegister — and the gen-specific count inputs that shape it.
If you know LLVM, the closest analogy is a TargetRegisterInfo that assigns MCRegister numbers, except the layout here is not generated by TableGen — it is computed at runtime by prefix-summing four per-class register counts into a contiguous arch-register space, in a fixed class order, with arch_regno 0 reserved as a null sentinel. There are three of these RegisterNumbering objects per Target, one per sequencer, each sized from a different subset of the count fields. The physical layer below this numbering — the 50-member ArchRegister enum that ArchRegisterInstance resolves — is documented in ResultFifo and ArchRegister Enums; this page is the layer above it.
For reimplementation, the contract is:
- The three-
RegisterNumbering-objects-per-Targetlayout and the four count source fields each consumes, plus the per-class name strings drawn from theTargetEnvironment. RegisterNumbering::Init: prefix-sum the four class counts intototal_registers_, reservearch_regno 0askNoRegister, then assign sequentialarch_regnos inkPreg → kSreg → kVmreg → kVregorder, building both maps.ToArchRegno/ToRegNum/ToArchRegString: the read paths, their bounds, and the(type, regno)print format.- The per-opcode variant classification (
LloOpcodeUsesTranspose/UsesRpu/IsRpuControl/IsRpuResult) and how the per-opArchRegisterread/written sets feed theCrossXluOperationsDataDependencyTracker.
| Build entry | Target::InitRegisterNumbering @ 0x1d614200 |
| Per-class numberer | RegisterNumbering::Init @ 0x1d622520 |
| Register assigner | RegisterNumbering::AddRegister @ 0x1d622bc0 |
| Reverse read | RegisterNumbering::ToArchRegno @ 0x1275f580 ([this+0x80][idx]) |
| Forward read | RegisterNumbering::ToRegNum @ 0x1d5a9000 ([this+0x98] map) |
| Name print | RegisterNumbering::ToArchRegString @ 0x1275e2a0 |
| Class order | kAllocatableRegisterTypes @ 0xadf7f54 = {kPreg=1, kSreg=2, kVmreg=3, kVreg=4} |
RegisterNumbering size | 0x130 bytes |
| Source file | platforms/xla/service/jellyfish/register_numbering.{h,cc} |
Target::InitRegisterNumbering — the Build
Purpose
InitRegisterNumbering @ 0x1d614200 is called once per Target to populate three RegisterNumbering objects embedded in the Target. Each captures one sequencer's register file. The object offsets and stride are byte-exact:
| Map | RegisterNumbering this | Sequencer role | Classes present |
|---|---|---|---|
| 1 | Target + 0x008 | Primary / TensorCore | all four (preg, sreg, vmreg, vreg) |
| 2 | Target + 0x138 | non-primary (preg+sreg only) | preg, sreg |
| 3 | Target + 0x268 | non-primary (vmreg+vreg only) | vmreg, vreg |
The stride 0x130 is the RegisterNumbering size (0x138 - 0x008 = 0x268 - 0x138 = 0x130). The decompile confirms three RegisterNumbering::Init calls at this+8, this+312 (0x138), and this+616 (0x268).
Algorithm
Each map is a freshly-built absl::flat_hash_map<RegisterType, pair<int count, string name>> with four entries (keys 1..4), handed to RegisterNumbering::Init. The per-class count comes from the Target register-count field block; the per-class name comes from the TargetEnvironment.
function InitRegisterNumbering(target): // sub_1d614200
env = target[0x940] // TargetEnvironment
// four per-class name strings, copied from the environment:
name_preg = env[0x18]; name_sreg = env[0x20]
name_vmreg = env[0x28]; name_vreg = env[0x30]
// ---- Map 1: primary / TensorCore sequencer (all 4 classes) ----
map = { kPreg: (target[0x4a4], name_preg), // *((DWORD*)target+297)
kSreg: (target[0x498], name_sreg), // *((DWORD*)target+294)
kVmreg: (target[0x4a0], name_vmreg), // *((DWORD*)target+296)
kVreg: (target[0x49c], name_vreg) } // *((DWORD*)target+295)
RegisterNumbering::Init(target + 0x008, map)
// ---- Map 2: preg+sreg-only sequencer ----
map = { kPreg: (target[0x4c0], ...), kSreg: (target[0x4b4], ...),
kVmreg: (0, ...), kVreg: (0, ...) }
RegisterNumbering::Init(target + 0x138, map)
// ---- Map 3: vmreg+vreg-only sequencer ----
map = { kPreg: (0, ...), kSreg: (0, ...),
kVmreg: (target[0x4bc], ...), kVreg: (target[0x4b8], ...) }
RegisterNumbering::Init(target + 0x268, map)
The count-field offsets are byte-confirmed in the decompile via the *((_DWORD *)this + N) index arithmetic: this+294 = 0x498 (Sreg), +295 = 0x49c (Vreg), +296 = 0x4a0 (Vmreg), +297 = 0x4a4 (Preg); the non-primary maps use +301 = 0x4b4, +302 = 0x4b8, +303 = 0x4bc, +304 = 0x4c0. The whole [0x498..0x4c0] block is zeroed by the Target constructor and populated from the chip-parts sequencer descriptor during Target::Init.
Function Map
| Function | Address | Role |
|---|---|---|
Target::InitRegisterNumbering | 0x1d614200 | Builds the 3 RegisterNumbering objects |
Target::RegisterCount | 0x1d617120 | Read side of the same 4 count fields, (seq, type)-keyed |
Target::SregCount | 0x1d6152c0 | mov 0x498(rdi),eax — confirms Sreg offset |
Target::VregCount | 0x1d6152e0 | mov 0x49c(rdi),eax — confirms Vreg offset |
Target::RegisterCount @ 0x1d617120 is the read-side accessor and cross-confirms the (seq, type) → offset map byte-for-byte: sequencer 0 maps {kPreg→0x4a4, kSreg→0x498, kVmreg→0x4a0, kVreg→0x49c}; sequencer 1 {kPreg→0x4c0, kSreg→0x4b4}; sequencer 2 {kVmreg→0x4bc, kVreg→0x4b8} — matching the three maps exactly, so each RegisterNumbering object is precisely one sequencer's register file.
NOTE — the per-class name strings are configurable per
TargetEnvironment(env+0x18/0x20/0x28/0x30), defaulting to the{p, s, vm, v}mnemonics ofRegisterTypeToMnemonic. A reimplementation must read the names from the environment, not hardcode them, although every shipped target uses the defaults.
GOTCHA — the second and third
RegisterNumberingobjects are partial — map 2 has only preg+sreg counts, map 3 only vmreg+vreg, with the other two classes set to count 0. The non-primary sequencer'sRegisterNumberingtherefore numbers a strict subset of register classes. The preciseTpuSequencerTypelabel of these two (a BarnaCore address-handler vs a SparseCore-tile sequencer) is inferred from the count subsets and theRegisterCountsequencer arms, not from a decoded sequencer-type table.
RegisterNumbering::Init — Numbering the Classes
Purpose
RegisterNumbering::Init @ 0x1d622520 is handed the flat_hash_map<RegisterType, (count, name)> for one sequencer and assigns a sequential arch_regno to every register, building both directions of the map.
Object layout
struct RegisterNumbering { // sizeof 0x130
int32 total_registers_; // +0x00 prefix-sum of the 4 class counts
InlinedBitVector<128> allocatable[5]; // +0x08 per-RegisterType masks, stride 0x18
vector<pair<RegisterType,int>> idx_to_regno_; // +0x80 arch_regno -> (type, regno)
flat_hash_map<pair<RegisterType,int>, int> regno_to_idx_; // +0x98 (type, regno) -> arch_regno
};
The per-type bit-vector sub-structs start at +0x08 with stride 0x18 (AddRegister computes &this[8 + 0x18*type]); five types occupy 0x08..0x80. The reverse vector is at +0x80 (int* index 32) and the forward map at +0x98 (int* index 38) — both byte-confirmed in AddRegister's push_back and insert targets.
Algorithm
function RegisterNumbering::Init(this, class_count_map): // sub_1d622520
this.total_registers_ = 1
running = 1
for type in kAllocatableRegisterTypes: // {kPreg, kSreg, kVmreg, kVreg}, stride 4, bound 16
running += class_count_map.at(type).count // map "at" lookup; .count = pair.first
this.total_registers_ = running // [this+0] = prefix-sum total
// NOTE: total starts at 1 — arch_regno 0 is reserved below, so the count
// includes the null sentinel.
AddRegister(this, kNone=0, regno=0, arch_regno=0, is_pseudo=false) // reserves arch_regno 0
next_arch_regno = 1
for type in kAllocatableRegisterTypes: // SECOND pass, same order
mnemonic = RegisterTypeToMnemonic(type)
spec = RangeSpec(...) // config-driven name/range filter
for k in [0, class_count_map.at(type).count):
name = StrCat(mnemonic, FastIntToBuffer(k)) // "v3", "s12", ...
is_pseudo = RangeSpec::Match(spec, k, name, 1)
AddRegister(this, type, regno=k, arch_regno=next_arch_regno + k, is_pseudo)
next_arch_regno += class_count_map.at(type).count
// post-conditions (assertion strings in the binary):
assert index_to_regno_[kNoRegister].first == RegisterType::kNone
assert GetMask(RegisterType::kNone).count() == 0
assert total_registers_ == reg_num // reg_num = final next_arch_regno
The assignment order is fixed by kAllocatableRegisterTypes @ 0xadf7f54 (four int32s = {1, 2, 3, 4}): the arch-register space is laid out as [sentinel] [Preg block] [Sreg block] [Vmreg block] [Vreg block], each block sized by its class count.
RegisterNumbering::AddRegister
function AddRegister(this, type, regno, arch_regno, is_pseudo): // sub_1d622bc0
assert type <= 4 // "reg_class < kNumberRegisterTypes"
assert arch_regno < this.total_registers_ // "reg_num < total_registers_"
// 1. mark this regno allocatable in the per-type bit vector at &this[8 + 0x18*type]
InlinedBitVector::resize(&this.allocatable[type], total_registers_)
// 2. reverse map: push (type, regno) at index arch_regno
this.idx_to_regno_.push_back({type, regno}) // [this+0x80]
// 3. forward map: (regno<<32 | type) -> arch_regno
this.regno_to_idx_.insert((regno << 32) | type, arch_regno) // [this+0x98]
// 4. if not the sentinel and not a pseudo, record it in the per-type allocatable list
if type != kNone and not is_pseudo:
set_bit(per_type_mask[type], arch_regno)
per_type_allocatable[type].push_back(arch_regno)
Function Map
| Function | Address | Role |
|---|---|---|
RegisterNumbering::Init | 0x1d622520 | Prefix-sum + two-pass assignment |
RegisterNumbering::AddRegister | 0x1d622bc0 | Assigns one arch_regno, builds both maps |
RegisterTypeToMnemonic | 0x1d640600 | Class prefix for the name (p/s/vm/v) |
FastIntToBuffer | 0x211719e0 | Integer→ASCII for the regno suffix |
RangeSpec::Match | 0x1d624c80 | Config-driven per-regno filter (sets is_pseudo) |
~RegisterNumbering | 0x1d491e60 | Frees the bit-vector / map buffers (confirms 0x130 size) |
QUIRK — the prefix-sum starts
total_registers_at 1 and the second pass assignsarch_regnos starting at 1, whileAddRegister(kNone, 0, 0, false)claimsarch_regno 0. So index 0 is always thekNoRegistersentinel, the final assigned count equalstotal_registers_, and the assertiontotal_registers_ == reg_numenforces it. A reimplementation that numbers from 0 will collide with the sentinel and fail this check.
NOTE —
RangeSpec::Matchgates which regnos are numbered and whether each is a pseudo (excluded from the allocatable masks/lists but still numbered). Its config source is aTargetEnvironmentregister name/range filter that can reserve or rename registers; the mechanism returns theis_pseudoflag passed toAddRegister. The nominal class count may therefore exceed the usable (allocatable) count.
Per-Gen Register-File Sizes
The four count fields the numbering consumes are populated from the chip-parts sequencer descriptor per TpuVersion. The field offsets and the prefix-sum layout are byte-exact; the numeric counts below are the chip-parts values keyed by generation.
| Generation (codename) | TpuVersion | SREG (0x498) | VREG (0x49c) | VMREG (0x4a0) | PREG (0x4a4) | Arch regs (approx) |
|---|---|---|---|---|---|---|
| v2 / v3 / v4 (jellyfish / dragonfish / pufferfish) | kJellyfish=0, kDragonfish=1, kPufferfish=2 | 32 | 32 | 8 | 15 | ~88 (+1 sentinel) |
v5p (+v5e lite) / v6e / v7 (viperfish / ghostlite / 6acc60406) | kViperfish=3, kGhostlite=4, gen-5=5 | 32 | 64 | 16 | 14 | ~127 (+1 sentinel) |
The TpuVersion enum has six values: {kJellyfish=0, kDragonfish=1, kPufferfish=2, kViperfish=3, kGhostlite=4, <gen-5>=5} (the sixth, the v7 generation, has codename 6acc60406 in this binary). The total arch-register count is 1 + PREG + SREG + VMREG + VREG (the +1 is arch_regno 0). The assignment order is always Preg → Sreg → Vmreg → Vreg, so the Preg block occupies arch_regno [1, 1+PREG), the Sreg block [1+PREG, 1+PREG+SREG), and so on.
Count-field offsets (0x498/0x49c/0x4a0/0x4a4) | SregCount/VregCount + RegisterCount jump-table decode |
Prefix-sum total, Preg→Sreg→Vmreg→Vreg block order | Init decompile |
| Numeric counts per generation | Chip-parts sequencer descriptor |
QUIRK — VREG doubles from 32 (v2/v3/v4) to 64 (v5p/v6e/v7), VMREG doubles 8→16, while PREG actually shrinks 15→14. A reimplementer numbering registers for a v5+ target who assumes the v2-era file sizes will mis-place every Sreg, Vmreg, and Vreg
arch_regno, because the block bases shift with the per-class counts.
Read Paths
ToArchRegno — reverse lookup
function ToArchRegno(this, arch_regno): // sub_1275f580
if arch_regno == 0: Fatal("reg_num != kNoRegister")
if arch_regno >= this.total_registers_: Fatal("reg_num < total_registers_")
if arch_regno >= this.idx_to_regno_.size: BUG()
return this.idx_to_regno_[arch_regno] // [this+0x80][arch_regno] -> (RegisterType, regno)
ToRegNum — forward lookup
function ToRegNum(this, type, regno): // sub_1d5a9000
if type == kNone: Fatal("register_type != RegisterType::kNone")
key = (regno << 32) | type
it = this.regno_to_idx_.find(key) // [this+0x98]
if not found: return 0
assert it.value <= 255 // "iter->second <= UINT8_MAX"
return it.value // arch_regno
ToArchRegString — printable name
ToArchRegString @ 0x1275e2a0 resolves a slot to its printable form: StrCat(RegisterTypeToMnemonic(type), FastIntToBuffer(regno)) after (type, regno) = ToArchRegno(arch_regno). So arch register arch_regno prints as "<p|s|vm|v><regno>" — e.g. v3, s12, vm5, p2.
| Function | Address | Reads |
|---|---|---|
ToArchRegno | 0x1275f580 | [this+0x80] reverse vector |
ToRegNum | 0x1d5a9000 | [this+0x98] forward map |
ToArchRegString | 0x1275e2a0 | ToArchRegno + mnemonic |
GOTCHA —
ToRegNumreturns 0 (thekNoRegistersentinel) on a miss, not a fatal error, and asserts the foundarch_regnofits in auint8. A virtual register whose(type, regno)was never numbered for this target silently resolves to "no register".ToArchRegno, by contrast, fatals onarch_regno == 0. The two directions are not symmetric in their error handling.
Per-Opcode Variant Classification
Each XLU opcode ComputeXluOperations selects (see ResultFifo and ArchRegister Enums) is wrapped in a std::variant<TransposeTile, RpuOperation, XluControlOperation>. The variant is 0x48 bytes wide (stride confirmed by lea (idx,idx,8); shl 3 = ×72), with the discriminant byte at +0x40 (the emitter writes movb $0/$1/$2, 0x40(...)). The index is chosen by four byte-exact classifier predicates:
| Classifier | Address | Predicate | Selects | Variant idx |
|---|---|---|---|---|
LloOpcodeUsesTranspose | 0x1d60bda0 | (op-0xa6)<2 || (op-0x154)<2 | {0xa6, 0xa7, 0x154, 0x155} | 0 TransposeTile |
LloOpcodeUsesRpu | 0x1d60c2c0 | (op-0xf5)<0xd (plus a low-band bitmask, below) | {0xf5..0x101} | 1 RpuOperation |
LloOpcodeIsRpuControl | 0x1d60c1e0 | (op-0x8b)<2 | {0x8b, 0x8c} | 2 XluControlOperation |
LloOpcodeIsRpuResult | 0x1d60c420 | (op-0x14f)<2 | {0x14f, 0x150} | 2 XluControlOperation |
Variant bodies
TransposeTile(index 0) is the only owning alternative. Layout:InlinedVector<ArchRegister,2>read-set at+0x00,InlinedVector<ArchRegister,2>written-set at+0x18, ani64partner/tile field at+0x30, a packedi32at+0x37, discriminant at+0x40. The destructor frees the two heap-backed register sets when their size exceeds the inline capacity of 2.RpuOperation(index 1) is trivially copyable (≤0x28 bytes). It holds anLloValue*at+0x10plus inline fusion-key metadata (RpuOperationMetadata:u16 opcodeat+0x00,i64pattern/segment key at+0x08,optional<i64>at+0x10gated by[+0x18]==1). TwoRpuOperations with equal metadata are fusion candidates.XluControlOperation(index 2) is trivially destructible — anLloValue*at+0x00plus inline control metadata, no owning vectors.
LloOpcodeUsesRpu has a secondary low-opcode arm: for op <= 0x3b it tests _bittest64(0xC40000000000000, op), which sets bits {0x36, 0x3a, 0x3b} (RPU ops outside the cross-lane reduce band). The primary (op-0xf5)<0xd arm covers the 13 reduce ops.
CrossXlu Data-Dependency Tracker
The per-op ArchRegister read/written sets ComputeXluOperations attaches to each variant are not used by the numbering itself — they drive a list-scheduler over the cross-lane ops. CrossXluOperationsDataDependencyTracker::Create @ 0x126cc9a0 walks the LloDependencyGraph in topological order, marks the nodes whose XLU-operation variant is in the input set, allocates a 0x330-byte tracker (operator new(0x330)), and connects each XLU op to the other XLU ops it has a true data dependency with.
function CrossXlu::Create(graph, xlu_ops, reverse): // sub_126cc9a0
nodes = graph.NodesInTopologicalOrder(reverse)
// pass 1: mark nodes whose variant is in xlu_ops (flat_hash_set::find)
for node in nodes:
if xlu_ops.contains(node.value.variant): node.is_xlu = true
tracker = new CrossXluOperationsDataDependencyTracker(region) // 0x330 bytes
// pass 2: per marked node, build cross-XLU edges
for node in marked_nodes:
tracker.internal_graph.AddUnsequencedNode(node.instruction) // [tracker+0x100]
for each def-use edge:
if LloDependencyGraphNode::IsOperandOf(other): // sub_14427d60
add edge (producer -> consumer); ++consumer.in_edges
return StatusOr<unique_ptr<...>>
The tracker constructor (0x126cda80) builds an internal LloDependencyGraph at tracker+0x100 and a LatencyTable (created from TpuVersion, the Target+0x398 field) at tracker+0xf8. Readiness is a standard in-edge-count test:
function XluOperationIsReady(tracker, op): // sub_126cd920
// dispatch on the variant discriminant at op[0x40]:
switch op[0x40]:
case 0: value = TransposeTile.read_set.last_element // InlinedVector at op+0x08
case 1: value = RpuOperation.lloValue // op+0x10
case 2: value = XluControlOperation.lloValue // op+0x00
default: Fatal("control != nullptr")
node = tracker.internal_graph.GetNode(value) // [tracker+0x100]
return node.in_edge_count == 0 // all predecessor XLU ops scheduled
RemoveScheduledXluOperation @ 0x126ccfa0 is the "fire" step — it removes a scheduled op and decrements its successors' in-edge counts. The edges are exactly the LloValue def-use chains: a producer XLU op whose written ArchRegister is read by a later XLU op becomes its predecessor (IsOperandOf confirms the genuine RAW/WAR relation). The consumers of the tracker are ComputeCombinablePairs (0x126d2480, fuses ops with equal RpuOperationMetadata), ReorderToShortenCriticalPath (0x126d3460), AssignXlu (0x126d3100), and AssignSourceBus (0x126d70e0).
| Function | Address | Role |
|---|---|---|
CrossXlu...::Create | 0x126cc9a0 | Topo-mark → new tracker → per-node edge build |
CrossXlu...::ctor | 0x126cda80 | Internal LloDependencyGraph + LatencyTable |
XluOperationIsReady | 0x126cd920 | Readiness = in-edge count == 0 |
RemoveScheduledXluOperation | 0x126ccfa0 | List-scheduler fire step |
LloDependencyGraphNode::IsOperandOf | 0x14427d60 | Confirms RAW/WAR data dependency |
NOTE — the edge weights the dependency tracker feeds the critical-path reorder are the per-version op latencies in the
LatencyTable(LatencyTable::Create@0x1c89fba0).
Related Components
| Component | Relationship |
|---|---|
ArchRegister / ArchRegisterInstance (result-fifo / arch-register enums) | The physical-slot layer this numbering maps to (RegisterType, regno) |
ComputeXluOperations (result-fifo / arch-register enums) | Builds the per-op ArchRegister read/written sets the tracker uses |
Target register-count fields | The four per-class counts (0x498..0x4c0) the numbering prefix-sums |
Cross-References
- ResultFifo and ArchRegister Enums — the 50-member
ArchRegisterphysical numbering and the 25 result FIFOs this numbering sits above - XLU Op Roster — the cross-lane opcode→factory table whose ops the variant classifiers tag
- Slot: VPU — the vector-processing slot whose registers this numbering assigns
- Slot: MXU — the matmul slot, its gain/latch matrix-register file distinct from these GPR classes
- MC Emitter — the machine-code emitter that consumes
arch_regnoat encode time - opcode_info_big Record Format — the per-opcode descriptor holding the
ArchRegisterread/written lists - Bundle Model — the per-gen VLIW bundle these arch registers are encoded into