ArchRegno Numbering

Every offset, value, and address on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). Other versions differ.

Abstract

The TensorCore encoder needs a single dense integer per architectural register at the moment it emits a bundle slot, but the LLO IR carries virtual registers typed by RegisterType (preg / sreg / vmreg / vreg) plus a regno within that class. The bridge is RegisterNumbering: a per-Target object that assigns every (RegisterType, regno) a sequential arch-register number (arch_regno), keeping both a forward map ((type, regno) → arch_regno, read by ToRegNum) and a reverse vector (arch_regno → (type, regno), read by ToArchRegno). Because the register-file sizes differ per silicon generation, the numbering is rebuilt at every Target construction from four generation-specific count fields. This page documents that build — Target::InitRegisterNumbering and RegisterNumbering::Init / AddRegister — and the gen-specific count inputs that shape it.

If you know LLVM, the closest analogy is a TargetRegisterInfo that assigns MCRegister numbers, except the layout here is not generated by TableGen — it is computed at runtime by prefix-summing four per-class register counts into a contiguous arch-register space, in a fixed class order, with arch_regno 0 reserved as a null sentinel. There are three of these RegisterNumbering objects per Target, one per sequencer, each sized from a different subset of the count fields. The physical layer below this numbering — the 50-member ArchRegister enum that ArchRegisterInstance resolves — is documented in ResultFifo and ArchRegister Enums; this page is the layer above it.

For reimplementation, the contract is:

The three-RegisterNumbering-objects-per-Target layout and the four count source fields each consumes, plus the per-class name strings drawn from the TargetEnvironment.
RegisterNumbering::Init: prefix-sum the four class counts into total_registers_, reserve arch_regno 0 as kNoRegister, then assign sequential arch_regnos in kPreg → kSreg → kVmreg → kVreg order, building both maps.
ToArchRegno / ToRegNum / ToArchRegString: the read paths, their bounds, and the (type, regno) print format.
The per-opcode variant classification (LloOpcodeUsesTranspose / UsesRpu / IsRpuControl / IsRpuResult) and how the per-op ArchRegister read/written sets feed the CrossXluOperationsDataDependencyTracker.


Build entry	`Target::InitRegisterNumbering` @ `0x1d614200`
Per-class numberer	`RegisterNumbering::Init` @ `0x1d622520`
Register assigner	`RegisterNumbering::AddRegister` @ `0x1d622bc0`
Reverse read	`RegisterNumbering::ToArchRegno` @ `0x1275f580` (`[this+0x80][idx]`)
Forward read	`RegisterNumbering::ToRegNum` @ `0x1d5a9000` (`[this+0x98]` map)
Name print	`RegisterNumbering::ToArchRegString` @ `0x1275e2a0`
Class order	`kAllocatableRegisterTypes` @ `0xadf7f54` = `{kPreg=1, kSreg=2, kVmreg=3, kVreg=4}`
`RegisterNumbering` size	`0x130` bytes
Source file	`platforms/xla/service/jellyfish/register_numbering.{h,cc}`

Target::InitRegisterNumbering — the Build

Purpose

InitRegisterNumbering @ 0x1d614200 is called once per Target to populate three RegisterNumbering objects embedded in the Target. Each captures one sequencer's register file. The object offsets and stride are byte-exact:

Map	`RegisterNumbering` `this`	Sequencer role	Classes present
1	`Target + 0x008`	Primary / TensorCore	all four (preg, sreg, vmreg, vreg)
2	`Target + 0x138`	non-primary (preg+sreg only)	preg, sreg
3	`Target + 0x268`	non-primary (vmreg+vreg only)	vmreg, vreg

The stride 0x130 is the RegisterNumbering size (0x138 - 0x008 = 0x268 - 0x138 = 0x130). The decompile confirms three RegisterNumbering::Init calls at this+8, this+312 (0x138), and this+616 (0x268).

Algorithm

Each map is a freshly-built absl::flat_hash_map<RegisterType, pair<int count, string name>> with four entries (keys 1..4), handed to RegisterNumbering::Init. The per-class count comes from the Target register-count field block; the per-class name comes from the TargetEnvironment.

function InitRegisterNumbering(target):              // sub_1d614200
    env = target[0x940]                              // TargetEnvironment
    // four per-class name strings, copied from the environment:
    name_preg  = env[0x18]; name_sreg = env[0x20]
    name_vmreg = env[0x28]; name_vreg = env[0x30]

    // ---- Map 1: primary / TensorCore sequencer (all 4 classes) ----
    map = { kPreg:  (target[0x4a4], name_preg),       // *((DWORD*)target+297)
            kSreg:  (target[0x498], name_sreg),       // *((DWORD*)target+294)
            kVmreg: (target[0x4a0], name_vmreg),      // *((DWORD*)target+296)
            kVreg:  (target[0x49c], name_vreg) }      // *((DWORD*)target+295)
    RegisterNumbering::Init(target + 0x008, map)

    // ---- Map 2: preg+sreg-only sequencer ----
    map = { kPreg: (target[0x4c0], ...), kSreg: (target[0x4b4], ...),
            kVmreg: (0, ...), kVreg: (0, ...) }
    RegisterNumbering::Init(target + 0x138, map)

    // ---- Map 3: vmreg+vreg-only sequencer ----
    map = { kPreg: (0, ...), kSreg: (0, ...),
            kVmreg: (target[0x4bc], ...), kVreg: (target[0x4b8], ...) }
    RegisterNumbering::Init(target + 0x268, map)

The count-field offsets are byte-confirmed in the decompile via the *((_DWORD *)this + N) index arithmetic: this+294 = 0x498 (Sreg), +295 = 0x49c (Vreg), +296 = 0x4a0 (Vmreg), +297 = 0x4a4 (Preg); the non-primary maps use +301 = 0x4b4, +302 = 0x4b8, +303 = 0x4bc, +304 = 0x4c0. The whole [0x498..0x4c0] block is zeroed by the Target constructor and populated from the chip-parts sequencer descriptor during Target::Init.

Function Map

Function	Address	Role
`Target::InitRegisterNumbering`	`0x1d614200`	Builds the 3 `RegisterNumbering` objects
`Target::RegisterCount`	`0x1d617120`	Read side of the same 4 count fields, `(seq, type)`-keyed
`Target::SregCount`	`0x1d6152c0`	`mov 0x498(rdi),eax` — confirms Sreg offset
`Target::VregCount`	`0x1d6152e0`	`mov 0x49c(rdi),eax` — confirms Vreg offset

Target::RegisterCount @ 0x1d617120 is the read-side accessor and cross-confirms the (seq, type) → offset map byte-for-byte: sequencer 0 maps {kPreg→0x4a4, kSreg→0x498, kVmreg→0x4a0, kVreg→0x49c}; sequencer 1 {kPreg→0x4c0, kSreg→0x4b4}; sequencer 2 {kVmreg→0x4bc, kVreg→0x4b8} — matching the three maps exactly, so each RegisterNumbering object is precisely one sequencer's register file.

NOTE — the per-class name strings are configurable per TargetEnvironment (env+0x18/0x20/0x28/0x30), defaulting to the {p, s, vm, v} mnemonics of RegisterTypeToMnemonic. A reimplementation must read the names from the environment, not hardcode them, although every shipped target uses the defaults.

GOTCHA — the second and third RegisterNumbering objects are partial — map 2 has only preg+sreg counts, map 3 only vmreg+vreg, with the other two classes set to count 0. The non-primary sequencer's RegisterNumbering therefore numbers a strict subset of register classes. The precise TpuSequencerType label of these two (a BarnaCore address-handler vs a SparseCore-tile sequencer) is inferred from the count subsets and the RegisterCount sequencer arms, not from a decoded sequencer-type table.

RegisterNumbering::Init — Numbering the Classes

Purpose

RegisterNumbering::Init @ 0x1d622520 is handed the flat_hash_map<RegisterType, (count, name)> for one sequencer and assigns a sequential arch_regno to every register, building both directions of the map.

Object layout

struct RegisterNumbering {                  // sizeof 0x130
    int32 total_registers_;                 // +0x00  prefix-sum of the 4 class counts
    InlinedBitVector<128> allocatable[5];   // +0x08  per-RegisterType masks, stride 0x18
    vector<pair<RegisterType,int>> idx_to_regno_;  // +0x80  arch_regno -> (type, regno)
    flat_hash_map<pair<RegisterType,int>, int> regno_to_idx_;  // +0x98  (type, regno) -> arch_regno
};

The per-type bit-vector sub-structs start at +0x08 with stride 0x18 (AddRegister computes &this[8 + 0x18*type]); five types occupy 0x08..0x80. The reverse vector is at +0x80 (int* index 32) and the forward map at +0x98 (int* index 38) — both byte-confirmed in AddRegister's push_back and insert targets.

Algorithm

function RegisterNumbering::Init(this, class_count_map):   // sub_1d622520
    this.total_registers_ = 1
    running = 1
    for type in kAllocatableRegisterTypes:        // {kPreg, kSreg, kVmreg, kVreg}, stride 4, bound 16
        running += class_count_map.at(type).count // map "at" lookup; .count = pair.first
        this.total_registers_ = running           // [this+0] = prefix-sum total
    // NOTE: total starts at 1 — arch_regno 0 is reserved below, so the count
    // includes the null sentinel.

    AddRegister(this, kNone=0, regno=0, arch_regno=0, is_pseudo=false)  // reserves arch_regno 0

    next_arch_regno = 1
    for type in kAllocatableRegisterTypes:        // SECOND pass, same order
        mnemonic = RegisterTypeToMnemonic(type)
        spec     = RangeSpec(...)                  // config-driven name/range filter
        for k in [0, class_count_map.at(type).count):
            name      = StrCat(mnemonic, FastIntToBuffer(k))  // "v3", "s12", ...
            is_pseudo = RangeSpec::Match(spec, k, name, 1)
            AddRegister(this, type, regno=k, arch_regno=next_arch_regno + k, is_pseudo)
        next_arch_regno += class_count_map.at(type).count

    // post-conditions (assertion strings in the binary):
    assert index_to_regno_[kNoRegister].first == RegisterType::kNone
    assert GetMask(RegisterType::kNone).count() == 0
    assert total_registers_ == reg_num   // reg_num = final next_arch_regno

The assignment order is fixed by kAllocatableRegisterTypes @ 0xadf7f54 (four int32s = {1, 2, 3, 4}): the arch-register space is laid out as [sentinel] [Preg block] [Sreg block] [Vmreg block] [Vreg block], each block sized by its class count.

RegisterNumbering::AddRegister

function AddRegister(this, type, regno, arch_regno, is_pseudo):  // sub_1d622bc0
    assert type <= 4               // "reg_class < kNumberRegisterTypes"
    assert arch_regno < this.total_registers_   // "reg_num < total_registers_"
    // 1. mark this regno allocatable in the per-type bit vector at &this[8 + 0x18*type]
    InlinedBitVector::resize(&this.allocatable[type], total_registers_)
    // 2. reverse map: push (type, regno) at index arch_regno
    this.idx_to_regno_.push_back({type, regno})           // [this+0x80]
    // 3. forward map: (regno<<32 | type) -> arch_regno
    this.regno_to_idx_.insert((regno << 32) | type, arch_regno)   // [this+0x98]
    // 4. if not the sentinel and not a pseudo, record it in the per-type allocatable list
    if type != kNone and not is_pseudo:
        set_bit(per_type_mask[type], arch_regno)
        per_type_allocatable[type].push_back(arch_regno)

Function Map

Function	Address	Role
`RegisterNumbering::Init`	`0x1d622520`	Prefix-sum + two-pass assignment
`RegisterNumbering::AddRegister`	`0x1d622bc0`	Assigns one `arch_regno`, builds both maps
`RegisterTypeToMnemonic`	`0x1d640600`	Class prefix for the name (`p`/`s`/`vm`/`v`)
`FastIntToBuffer`	`0x211719e0`	Integer→ASCII for the regno suffix
`RangeSpec::Match`	`0x1d624c80`	Config-driven per-regno filter (sets `is_pseudo`)
`~RegisterNumbering`	`0x1d491e60`	Frees the bit-vector / map buffers (confirms `0x130` size)

QUIRK — the prefix-sum starts total_registers_ at 1 and the second pass assigns arch_regnos starting at 1, while AddRegister(kNone, 0, 0, false) claims arch_regno 0. So index 0 is always the kNoRegister sentinel, the final assigned count equals total_registers_, and the assertion total_registers_ == reg_num enforces it. A reimplementation that numbers from 0 will collide with the sentinel and fail this check.

NOTE — RangeSpec::Match gates which regnos are numbered and whether each is a pseudo (excluded from the allocatable masks/lists but still numbered). Its config source is a TargetEnvironment register name/range filter that can reserve or rename registers; the mechanism returns the is_pseudo flag passed to AddRegister. The nominal class count may therefore exceed the usable (allocatable) count.

Per-Gen Register-File Sizes

The four count fields the numbering consumes are populated from the chip-parts sequencer descriptor per TpuVersion. The field offsets and the prefix-sum layout are byte-exact; the numeric counts below are the chip-parts values keyed by generation.

Generation (codename)	`TpuVersion`	SREG (`0x498`)	VREG (`0x49c`)	VMREG (`0x4a0`)	PREG (`0x4a4`)	Arch regs (approx)
v2 / v3 / v4 (jellyfish / dragonfish / pufferfish)	`kJellyfish`=0, `kDragonfish`=1, `kPufferfish`=2	32	32	8	15	~88 (+1 sentinel)
v5p (+v5e lite) / v6e / v7 (viperfish / ghostlite / `6acc60406`)	`kViperfish`=3, `kGhostlite`=4, gen-5=5	32	64	16	14	~127 (+1 sentinel)

The TpuVersion enum has six values: {kJellyfish=0, kDragonfish=1, kPufferfish=2, kViperfish=3, kGhostlite=4, <gen-5>=5} (the sixth, the v7 generation, has codename 6acc60406 in this binary). The total arch-register count is 1 + PREG + SREG + VMREG + VREG (the +1 is arch_regno 0). The assignment order is always Preg → Sreg → Vmreg → Vreg, so the Preg block occupies arch_regno [1, 1+PREG), the Sreg block [1+PREG, 1+PREG+SREG), and so on.


Count-field offsets (`0x498/0x49c/0x4a0/0x4a4`)	`SregCount`/`VregCount` + `RegisterCount` jump-table decode
Prefix-sum total, `Preg→Sreg→Vmreg→Vreg` block order	`Init` decompile
Numeric counts per generation	Chip-parts sequencer descriptor

QUIRK — VREG doubles from 32 (v2/v3/v4) to 64 (v5p/v6e/v7), VMREG doubles 8→16, while PREG actually shrinks 15→14. A reimplementer numbering registers for a v5+ target who assumes the v2-era file sizes will mis-place every Sreg, Vmreg, and Vreg arch_regno, because the block bases shift with the per-class counts.

Read Paths

ToArchRegno — reverse lookup

function ToArchRegno(this, arch_regno):              // sub_1275f580
    if arch_regno == 0: Fatal("reg_num != kNoRegister")
    if arch_regno >= this.total_registers_: Fatal("reg_num < total_registers_")
    if arch_regno >= this.idx_to_regno_.size: BUG()
    return this.idx_to_regno_[arch_regno]            // [this+0x80][arch_regno] -> (RegisterType, regno)

ToRegNum — forward lookup

function ToRegNum(this, type, regno):                // sub_1d5a9000
    if type == kNone: Fatal("register_type != RegisterType::kNone")
    key = (regno << 32) | type
    it = this.regno_to_idx_.find(key)                // [this+0x98]
    if not found: return 0
    assert it.value <= 255                            // "iter->second <= UINT8_MAX"
    return it.value                                   // arch_regno

ToArchRegString — printable name

ToArchRegString @ 0x1275e2a0 resolves a slot to its printable form: StrCat(RegisterTypeToMnemonic(type), FastIntToBuffer(regno)) after (type, regno) = ToArchRegno(arch_regno). So arch register arch_regno prints as "<p|s|vm|v><regno>" — e.g. v3, s12, vm5, p2.

Function	Address	Reads
`ToArchRegno`	`0x1275f580`	`[this+0x80]` reverse vector
`ToRegNum`	`0x1d5a9000`	`[this+0x98]` forward map
`ToArchRegString`	`0x1275e2a0`	`ToArchRegno` + mnemonic

GOTCHA — ToRegNum returns 0 (the kNoRegister sentinel) on a miss, not a fatal error, and asserts the found arch_regno fits in a uint8. A virtual register whose (type, regno) was never numbered for this target silently resolves to "no register". ToArchRegno, by contrast, fatals on arch_regno == 0. The two directions are not symmetric in their error handling.

Per-Opcode Variant Classification

Each XLU opcode ComputeXluOperations selects (see ResultFifo and ArchRegister Enums) is wrapped in a std::variant<TransposeTile, RpuOperation, XluControlOperation>. The variant is 0x48 bytes wide (stride confirmed by lea (idx,idx,8); shl 3 = ×72), with the discriminant byte at +0x40 (the emitter writes movb $0/$1/$2, 0x40(...)). The index is chosen by four byte-exact classifier predicates:

Classifier	Address	Predicate	Selects	Variant idx
`LloOpcodeUsesTranspose`	`0x1d60bda0`	`(op-0xa6)<2 \|\| (op-0x154)<2`	`{0xa6, 0xa7, 0x154, 0x155}`	0 `TransposeTile`
`LloOpcodeUsesRpu`	`0x1d60c2c0`	`(op-0xf5)<0xd` (plus a low-band bitmask, below)	`{0xf5..0x101}`	1 `RpuOperation`
`LloOpcodeIsRpuControl`	`0x1d60c1e0`	`(op-0x8b)<2`	`{0x8b, 0x8c}`	2 `XluControlOperation`
`LloOpcodeIsRpuResult`	`0x1d60c420`	`(op-0x14f)<2`	`{0x14f, 0x150}`	2 `XluControlOperation`

Variant bodies

TransposeTile (index 0) is the only owning alternative. Layout: InlinedVector<ArchRegister,2> read-set at +0x00, InlinedVector<ArchRegister,2> written-set at +0x18, an i64 partner/tile field at +0x30, a packed i32 at +0x37, discriminant at +0x40. The destructor frees the two heap-backed register sets when their size exceeds the inline capacity of 2.
RpuOperation (index 1) is trivially copyable (≤0x28 bytes). It holds an LloValue* at +0x10 plus inline fusion-key metadata (RpuOperationMetadata: u16 opcode at +0x00, i64 pattern/segment key at +0x08, optional<i64> at +0x10 gated by [+0x18]==1). Two RpuOperations with equal metadata are fusion candidates.
XluControlOperation (index 2) is trivially destructible — an LloValue* at +0x00 plus inline control metadata, no owning vectors.

LloOpcodeUsesRpu has a secondary low-opcode arm: for op <= 0x3b it tests _bittest64(0xC40000000000000, op), which sets bits {0x36, 0x3a, 0x3b} (RPU ops outside the cross-lane reduce band). The primary (op-0xf5)<0xd arm covers the 13 reduce ops.

CrossXlu Data-Dependency Tracker

The per-op ArchRegister read/written sets ComputeXluOperations attaches to each variant are not used by the numbering itself — they drive a list-scheduler over the cross-lane ops. CrossXluOperationsDataDependencyTracker::Create @ 0x126cc9a0 walks the LloDependencyGraph in topological order, marks the nodes whose XLU-operation variant is in the input set, allocates a 0x330-byte tracker (operator new(0x330)), and connects each XLU op to the other XLU ops it has a true data dependency with.

function CrossXlu::Create(graph, xlu_ops, reverse):  // sub_126cc9a0
    nodes = graph.NodesInTopologicalOrder(reverse)
    // pass 1: mark nodes whose variant is in xlu_ops (flat_hash_set::find)
    for node in nodes:
        if xlu_ops.contains(node.value.variant): node.is_xlu = true
    tracker = new CrossXluOperationsDataDependencyTracker(region)  // 0x330 bytes
    // pass 2: per marked node, build cross-XLU edges
    for node in marked_nodes:
        tracker.internal_graph.AddUnsequencedNode(node.instruction)  // [tracker+0x100]
        for each def-use edge:
            if LloDependencyGraphNode::IsOperandOf(other):           // sub_14427d60
                add edge (producer -> consumer); ++consumer.in_edges
    return StatusOr<unique_ptr<...>>

The tracker constructor (0x126cda80) builds an internal LloDependencyGraph at tracker+0x100 and a LatencyTable (created from TpuVersion, the Target+0x398 field) at tracker+0xf8. Readiness is a standard in-edge-count test:

function XluOperationIsReady(tracker, op):            // sub_126cd920
    // dispatch on the variant discriminant at op[0x40]:
    switch op[0x40]:
        case 0: value = TransposeTile.read_set.last_element  // InlinedVector at op+0x08
        case 1: value = RpuOperation.lloValue                // op+0x10
        case 2: value = XluControlOperation.lloValue         // op+0x00
        default: Fatal("control != nullptr")
    node = tracker.internal_graph.GetNode(value)     // [tracker+0x100]
    return node.in_edge_count == 0                   // all predecessor XLU ops scheduled

RemoveScheduledXluOperation @ 0x126ccfa0 is the "fire" step — it removes a scheduled op and decrements its successors' in-edge counts. The edges are exactly the LloValue def-use chains: a producer XLU op whose written ArchRegister is read by a later XLU op becomes its predecessor (IsOperandOf confirms the genuine RAW/WAR relation). The consumers of the tracker are ComputeCombinablePairs (0x126d2480, fuses ops with equal RpuOperationMetadata), ReorderToShortenCriticalPath (0x126d3460), AssignXlu (0x126d3100), and AssignSourceBus (0x126d70e0).

Function	Address	Role
`CrossXlu...::Create`	`0x126cc9a0`	Topo-mark → new tracker → per-node edge build
`CrossXlu...::ctor`	`0x126cda80`	Internal `LloDependencyGraph` + `LatencyTable`
`XluOperationIsReady`	`0x126cd920`	Readiness = in-edge count == 0
`RemoveScheduledXluOperation`	`0x126ccfa0`	List-scheduler fire step
`LloDependencyGraphNode::IsOperandOf`	`0x14427d60`	Confirms RAW/WAR data dependency

NOTE — the edge weights the dependency tracker feeds the critical-path reorder are the per-version op latencies in the LatencyTable (LatencyTable::Create @ 0x1c89fba0).

Component	Relationship
`ArchRegister` / `ArchRegisterInstance` (result-fifo / arch-register enums)	The physical-slot layer this numbering maps to `(RegisterType, regno)`
`ComputeXluOperations` (result-fifo / arch-register enums)	Builds the per-op `ArchRegister` read/written sets the tracker uses
`Target` register-count fields	The four per-class counts (`0x498..0x4c0`) the numbering prefix-sums

Cross-References

ResultFifo and ArchRegister Enums — the 50-member ArchRegister physical numbering and the 25 result FIFOs this numbering sits above
XLU Op Roster — the cross-lane opcode→factory table whose ops the variant classifiers tag
Slot: VPU — the vector-processing slot whose registers this numbering assigns
Slot: MXU — the matmul slot, its gain/latch matrix-register file distinct from these GPR classes
MC Emitter — the machine-code emitter that consumes arch_regno at encode time
opcode_info_big Record Format — the per-opcode descriptor holding the ArchRegister read/written lists
Bundle Model — the per-gen VLIW bundle these arch registers are encoded into

Keyboard shortcuts

libtpu Internals — Reverse-Engineering Reference