Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ArchRegno Numbering

Every offset, value, and address on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). Other versions differ.

Abstract

The TensorCore encoder needs a single dense integer per architectural register at the moment it emits a bundle slot, but the LLO IR carries virtual registers typed by RegisterType (preg / sreg / vmreg / vreg) plus a regno within that class. The bridge is RegisterNumbering: a per-Target object that assigns every (RegisterType, regno) a sequential arch-register number (arch_regno), keeping both a forward map ((type, regno) → arch_regno, read by ToRegNum) and a reverse vector (arch_regno → (type, regno), read by ToArchRegno). Because the register-file sizes differ per silicon generation, the numbering is rebuilt at every Target construction from four generation-specific count fields. This page documents that build — Target::InitRegisterNumbering and RegisterNumbering::Init / AddRegister — and the gen-specific count inputs that shape it.

If you know LLVM, the closest analogy is a TargetRegisterInfo that assigns MCRegister numbers, except the layout here is not generated by TableGen — it is computed at runtime by prefix-summing four per-class register counts into a contiguous arch-register space, in a fixed class order, with arch_regno 0 reserved as a null sentinel. There are three of these RegisterNumbering objects per Target, one per sequencer, each sized from a different subset of the count fields. The physical layer below this numbering — the 50-member ArchRegister enum that ArchRegisterInstance resolves — is documented in ResultFifo and ArchRegister Enums; this page is the layer above it.

For reimplementation, the contract is:

  • The three-RegisterNumbering-objects-per-Target layout and the four count source fields each consumes, plus the per-class name strings drawn from the TargetEnvironment.
  • RegisterNumbering::Init: prefix-sum the four class counts into total_registers_, reserve arch_regno 0 as kNoRegister, then assign sequential arch_regnos in kPreg → kSreg → kVmreg → kVreg order, building both maps.
  • ToArchRegno / ToRegNum / ToArchRegString: the read paths, their bounds, and the (type, regno) print format.
  • The per-opcode variant classification (LloOpcodeUsesTranspose / UsesRpu / IsRpuControl / IsRpuResult) and how the per-op ArchRegister read/written sets feed the CrossXluOperationsDataDependencyTracker.
Build entryTarget::InitRegisterNumbering @ 0x1d614200
Per-class numbererRegisterNumbering::Init @ 0x1d622520
Register assignerRegisterNumbering::AddRegister @ 0x1d622bc0
Reverse readRegisterNumbering::ToArchRegno @ 0x1275f580 ([this+0x80][idx])
Forward readRegisterNumbering::ToRegNum @ 0x1d5a9000 ([this+0x98] map)
Name printRegisterNumbering::ToArchRegString @ 0x1275e2a0
Class orderkAllocatableRegisterTypes @ 0xadf7f54 = {kPreg=1, kSreg=2, kVmreg=3, kVreg=4}
RegisterNumbering size0x130 bytes
Source fileplatforms/xla/service/jellyfish/register_numbering.{h,cc}

Target::InitRegisterNumbering — the Build

Purpose

InitRegisterNumbering @ 0x1d614200 is called once per Target to populate three RegisterNumbering objects embedded in the Target. Each captures one sequencer's register file. The object offsets and stride are byte-exact:

MapRegisterNumbering thisSequencer roleClasses present
1Target + 0x008Primary / TensorCoreall four (preg, sreg, vmreg, vreg)
2Target + 0x138non-primary (preg+sreg only)preg, sreg
3Target + 0x268non-primary (vmreg+vreg only)vmreg, vreg

The stride 0x130 is the RegisterNumbering size (0x138 - 0x008 = 0x268 - 0x138 = 0x130). The decompile confirms three RegisterNumbering::Init calls at this+8, this+312 (0x138), and this+616 (0x268).

Algorithm

Each map is a freshly-built absl::flat_hash_map<RegisterType, pair<int count, string name>> with four entries (keys 1..4), handed to RegisterNumbering::Init. The per-class count comes from the Target register-count field block; the per-class name comes from the TargetEnvironment.

function InitRegisterNumbering(target):              // sub_1d614200
    env = target[0x940]                              // TargetEnvironment
    // four per-class name strings, copied from the environment:
    name_preg  = env[0x18]; name_sreg = env[0x20]
    name_vmreg = env[0x28]; name_vreg = env[0x30]

    // ---- Map 1: primary / TensorCore sequencer (all 4 classes) ----
    map = { kPreg:  (target[0x4a4], name_preg),       // *((DWORD*)target+297)
            kSreg:  (target[0x498], name_sreg),       // *((DWORD*)target+294)
            kVmreg: (target[0x4a0], name_vmreg),      // *((DWORD*)target+296)
            kVreg:  (target[0x49c], name_vreg) }      // *((DWORD*)target+295)
    RegisterNumbering::Init(target + 0x008, map)

    // ---- Map 2: preg+sreg-only sequencer ----
    map = { kPreg: (target[0x4c0], ...), kSreg: (target[0x4b4], ...),
            kVmreg: (0, ...), kVreg: (0, ...) }
    RegisterNumbering::Init(target + 0x138, map)

    // ---- Map 3: vmreg+vreg-only sequencer ----
    map = { kPreg: (0, ...), kSreg: (0, ...),
            kVmreg: (target[0x4bc], ...), kVreg: (target[0x4b8], ...) }
    RegisterNumbering::Init(target + 0x268, map)

The count-field offsets are byte-confirmed in the decompile via the *((_DWORD *)this + N) index arithmetic: this+294 = 0x498 (Sreg), +295 = 0x49c (Vreg), +296 = 0x4a0 (Vmreg), +297 = 0x4a4 (Preg); the non-primary maps use +301 = 0x4b4, +302 = 0x4b8, +303 = 0x4bc, +304 = 0x4c0. The whole [0x498..0x4c0] block is zeroed by the Target constructor and populated from the chip-parts sequencer descriptor during Target::Init.

Function Map

FunctionAddressRole
Target::InitRegisterNumbering0x1d614200Builds the 3 RegisterNumbering objects
Target::RegisterCount0x1d617120Read side of the same 4 count fields, (seq, type)-keyed
Target::SregCount0x1d6152c0mov 0x498(rdi),eax — confirms Sreg offset
Target::VregCount0x1d6152e0mov 0x49c(rdi),eax — confirms Vreg offset

Target::RegisterCount @ 0x1d617120 is the read-side accessor and cross-confirms the (seq, type) → offset map byte-for-byte: sequencer 0 maps {kPreg→0x4a4, kSreg→0x498, kVmreg→0x4a0, kVreg→0x49c}; sequencer 1 {kPreg→0x4c0, kSreg→0x4b4}; sequencer 2 {kVmreg→0x4bc, kVreg→0x4b8} — matching the three maps exactly, so each RegisterNumbering object is precisely one sequencer's register file.

NOTE — the per-class name strings are configurable per TargetEnvironment (env+0x18/0x20/0x28/0x30), defaulting to the {p, s, vm, v} mnemonics of RegisterTypeToMnemonic. A reimplementation must read the names from the environment, not hardcode them, although every shipped target uses the defaults.

GOTCHA — the second and third RegisterNumbering objects are partial — map 2 has only preg+sreg counts, map 3 only vmreg+vreg, with the other two classes set to count 0. The non-primary sequencer's RegisterNumbering therefore numbers a strict subset of register classes. The precise TpuSequencerType label of these two (a BarnaCore address-handler vs a SparseCore-tile sequencer) is inferred from the count subsets and the RegisterCount sequencer arms, not from a decoded sequencer-type table.


RegisterNumbering::Init — Numbering the Classes

Purpose

RegisterNumbering::Init @ 0x1d622520 is handed the flat_hash_map<RegisterType, (count, name)> for one sequencer and assigns a sequential arch_regno to every register, building both directions of the map.

Object layout

struct RegisterNumbering {                  // sizeof 0x130
    int32 total_registers_;                 // +0x00  prefix-sum of the 4 class counts
    InlinedBitVector<128> allocatable[5];   // +0x08  per-RegisterType masks, stride 0x18
    vector<pair<RegisterType,int>> idx_to_regno_;  // +0x80  arch_regno -> (type, regno)
    flat_hash_map<pair<RegisterType,int>, int> regno_to_idx_;  // +0x98  (type, regno) -> arch_regno
};

The per-type bit-vector sub-structs start at +0x08 with stride 0x18 (AddRegister computes &this[8 + 0x18*type]); five types occupy 0x08..0x80. The reverse vector is at +0x80 (int* index 32) and the forward map at +0x98 (int* index 38) — both byte-confirmed in AddRegister's push_back and insert targets.

Algorithm

function RegisterNumbering::Init(this, class_count_map):   // sub_1d622520
    this.total_registers_ = 1
    running = 1
    for type in kAllocatableRegisterTypes:        // {kPreg, kSreg, kVmreg, kVreg}, stride 4, bound 16
        running += class_count_map.at(type).count // map "at" lookup; .count = pair.first
        this.total_registers_ = running           // [this+0] = prefix-sum total
    // NOTE: total starts at 1 — arch_regno 0 is reserved below, so the count
    // includes the null sentinel.

    AddRegister(this, kNone=0, regno=0, arch_regno=0, is_pseudo=false)  // reserves arch_regno 0

    next_arch_regno = 1
    for type in kAllocatableRegisterTypes:        // SECOND pass, same order
        mnemonic = RegisterTypeToMnemonic(type)
        spec     = RangeSpec(...)                  // config-driven name/range filter
        for k in [0, class_count_map.at(type).count):
            name      = StrCat(mnemonic, FastIntToBuffer(k))  // "v3", "s12", ...
            is_pseudo = RangeSpec::Match(spec, k, name, 1)
            AddRegister(this, type, regno=k, arch_regno=next_arch_regno + k, is_pseudo)
        next_arch_regno += class_count_map.at(type).count

    // post-conditions (assertion strings in the binary):
    assert index_to_regno_[kNoRegister].first == RegisterType::kNone
    assert GetMask(RegisterType::kNone).count() == 0
    assert total_registers_ == reg_num   // reg_num = final next_arch_regno

The assignment order is fixed by kAllocatableRegisterTypes @ 0xadf7f54 (four int32s = {1, 2, 3, 4}): the arch-register space is laid out as [sentinel] [Preg block] [Sreg block] [Vmreg block] [Vreg block], each block sized by its class count.

RegisterNumbering::AddRegister

function AddRegister(this, type, regno, arch_regno, is_pseudo):  // sub_1d622bc0
    assert type <= 4               // "reg_class < kNumberRegisterTypes"
    assert arch_regno < this.total_registers_   // "reg_num < total_registers_"
    // 1. mark this regno allocatable in the per-type bit vector at &this[8 + 0x18*type]
    InlinedBitVector::resize(&this.allocatable[type], total_registers_)
    // 2. reverse map: push (type, regno) at index arch_regno
    this.idx_to_regno_.push_back({type, regno})           // [this+0x80]
    // 3. forward map: (regno<<32 | type) -> arch_regno
    this.regno_to_idx_.insert((regno << 32) | type, arch_regno)   // [this+0x98]
    // 4. if not the sentinel and not a pseudo, record it in the per-type allocatable list
    if type != kNone and not is_pseudo:
        set_bit(per_type_mask[type], arch_regno)
        per_type_allocatable[type].push_back(arch_regno)

Function Map

FunctionAddressRole
RegisterNumbering::Init0x1d622520Prefix-sum + two-pass assignment
RegisterNumbering::AddRegister0x1d622bc0Assigns one arch_regno, builds both maps
RegisterTypeToMnemonic0x1d640600Class prefix for the name (p/s/vm/v)
FastIntToBuffer0x211719e0Integer→ASCII for the regno suffix
RangeSpec::Match0x1d624c80Config-driven per-regno filter (sets is_pseudo)
~RegisterNumbering0x1d491e60Frees the bit-vector / map buffers (confirms 0x130 size)

QUIRK — the prefix-sum starts total_registers_ at 1 and the second pass assigns arch_regnos starting at 1, while AddRegister(kNone, 0, 0, false) claims arch_regno 0. So index 0 is always the kNoRegister sentinel, the final assigned count equals total_registers_, and the assertion total_registers_ == reg_num enforces it. A reimplementation that numbers from 0 will collide with the sentinel and fail this check.

NOTE — RangeSpec::Match gates which regnos are numbered and whether each is a pseudo (excluded from the allocatable masks/lists but still numbered). Its config source is a TargetEnvironment register name/range filter that can reserve or rename registers; the mechanism returns the is_pseudo flag passed to AddRegister. The nominal class count may therefore exceed the usable (allocatable) count.


Per-Gen Register-File Sizes

The four count fields the numbering consumes are populated from the chip-parts sequencer descriptor per TpuVersion. The field offsets and the prefix-sum layout are byte-exact; the numeric counts below are the chip-parts values keyed by generation.

Generation (codename)TpuVersionSREG (0x498)VREG (0x49c)VMREG (0x4a0)PREG (0x4a4)Arch regs (approx)
v2 / v3 / v4 (jellyfish / dragonfish / pufferfish)kJellyfish=0, kDragonfish=1, kPufferfish=23232815~88 (+1 sentinel)
v5p (+v5e lite) / v6e / v7 (viperfish / ghostlite / 6acc60406)kViperfish=3, kGhostlite=4, gen-5=532641614~127 (+1 sentinel)

The TpuVersion enum has six values: {kJellyfish=0, kDragonfish=1, kPufferfish=2, kViperfish=3, kGhostlite=4, <gen-5>=5} (the sixth, the v7 generation, has codename 6acc60406 in this binary). The total arch-register count is 1 + PREG + SREG + VMREG + VREG (the +1 is arch_regno 0). The assignment order is always Preg → Sreg → Vmreg → Vreg, so the Preg block occupies arch_regno [1, 1+PREG), the Sreg block [1+PREG, 1+PREG+SREG), and so on.

Count-field offsets (0x498/0x49c/0x4a0/0x4a4)SregCount/VregCount + RegisterCount jump-table decode
Prefix-sum total, Preg→Sreg→Vmreg→Vreg block orderInit decompile
Numeric counts per generationChip-parts sequencer descriptor

QUIRK — VREG doubles from 32 (v2/v3/v4) to 64 (v5p/v6e/v7), VMREG doubles 8→16, while PREG actually shrinks 15→14. A reimplementer numbering registers for a v5+ target who assumes the v2-era file sizes will mis-place every Sreg, Vmreg, and Vreg arch_regno, because the block bases shift with the per-class counts.


Read Paths

ToArchRegno — reverse lookup

function ToArchRegno(this, arch_regno):              // sub_1275f580
    if arch_regno == 0: Fatal("reg_num != kNoRegister")
    if arch_regno >= this.total_registers_: Fatal("reg_num < total_registers_")
    if arch_regno >= this.idx_to_regno_.size: BUG()
    return this.idx_to_regno_[arch_regno]            // [this+0x80][arch_regno] -> (RegisterType, regno)

ToRegNum — forward lookup

function ToRegNum(this, type, regno):                // sub_1d5a9000
    if type == kNone: Fatal("register_type != RegisterType::kNone")
    key = (regno << 32) | type
    it = this.regno_to_idx_.find(key)                // [this+0x98]
    if not found: return 0
    assert it.value <= 255                            // "iter->second <= UINT8_MAX"
    return it.value                                   // arch_regno

ToArchRegString — printable name

ToArchRegString @ 0x1275e2a0 resolves a slot to its printable form: StrCat(RegisterTypeToMnemonic(type), FastIntToBuffer(regno)) after (type, regno) = ToArchRegno(arch_regno). So arch register arch_regno prints as "<p|s|vm|v><regno>" — e.g. v3, s12, vm5, p2.

FunctionAddressReads
ToArchRegno0x1275f580[this+0x80] reverse vector
ToRegNum0x1d5a9000[this+0x98] forward map
ToArchRegString0x1275e2a0ToArchRegno + mnemonic

GOTCHA — ToRegNum returns 0 (the kNoRegister sentinel) on a miss, not a fatal error, and asserts the found arch_regno fits in a uint8. A virtual register whose (type, regno) was never numbered for this target silently resolves to "no register". ToArchRegno, by contrast, fatals on arch_regno == 0. The two directions are not symmetric in their error handling.


Per-Opcode Variant Classification

Each XLU opcode ComputeXluOperations selects (see ResultFifo and ArchRegister Enums) is wrapped in a std::variant<TransposeTile, RpuOperation, XluControlOperation>. The variant is 0x48 bytes wide (stride confirmed by lea (idx,idx,8); shl 3 = ×72), with the discriminant byte at +0x40 (the emitter writes movb $0/$1/$2, 0x40(...)). The index is chosen by four byte-exact classifier predicates:

ClassifierAddressPredicateSelectsVariant idx
LloOpcodeUsesTranspose0x1d60bda0(op-0xa6)<2 || (op-0x154)<2{0xa6, 0xa7, 0x154, 0x155}0 TransposeTile
LloOpcodeUsesRpu0x1d60c2c0(op-0xf5)<0xd (plus a low-band bitmask, below){0xf5..0x101}1 RpuOperation
LloOpcodeIsRpuControl0x1d60c1e0(op-0x8b)<2{0x8b, 0x8c}2 XluControlOperation
LloOpcodeIsRpuResult0x1d60c420(op-0x14f)<2{0x14f, 0x150}2 XluControlOperation

Variant bodies

  • TransposeTile (index 0) is the only owning alternative. Layout: InlinedVector<ArchRegister,2> read-set at +0x00, InlinedVector<ArchRegister,2> written-set at +0x18, an i64 partner/tile field at +0x30, a packed i32 at +0x37, discriminant at +0x40. The destructor frees the two heap-backed register sets when their size exceeds the inline capacity of 2.
  • RpuOperation (index 1) is trivially copyable (≤0x28 bytes). It holds an LloValue* at +0x10 plus inline fusion-key metadata (RpuOperationMetadata: u16 opcode at +0x00, i64 pattern/segment key at +0x08, optional<i64> at +0x10 gated by [+0x18]==1). Two RpuOperations with equal metadata are fusion candidates.
  • XluControlOperation (index 2) is trivially destructible — an LloValue* at +0x00 plus inline control metadata, no owning vectors.

LloOpcodeUsesRpu has a secondary low-opcode arm: for op <= 0x3b it tests _bittest64(0xC40000000000000, op), which sets bits {0x36, 0x3a, 0x3b} (RPU ops outside the cross-lane reduce band). The primary (op-0xf5)<0xd arm covers the 13 reduce ops.


CrossXlu Data-Dependency Tracker

The per-op ArchRegister read/written sets ComputeXluOperations attaches to each variant are not used by the numbering itself — they drive a list-scheduler over the cross-lane ops. CrossXluOperationsDataDependencyTracker::Create @ 0x126cc9a0 walks the LloDependencyGraph in topological order, marks the nodes whose XLU-operation variant is in the input set, allocates a 0x330-byte tracker (operator new(0x330)), and connects each XLU op to the other XLU ops it has a true data dependency with.

function CrossXlu::Create(graph, xlu_ops, reverse):  // sub_126cc9a0
    nodes = graph.NodesInTopologicalOrder(reverse)
    // pass 1: mark nodes whose variant is in xlu_ops (flat_hash_set::find)
    for node in nodes:
        if xlu_ops.contains(node.value.variant): node.is_xlu = true
    tracker = new CrossXluOperationsDataDependencyTracker(region)  // 0x330 bytes
    // pass 2: per marked node, build cross-XLU edges
    for node in marked_nodes:
        tracker.internal_graph.AddUnsequencedNode(node.instruction)  // [tracker+0x100]
        for each def-use edge:
            if LloDependencyGraphNode::IsOperandOf(other):           // sub_14427d60
                add edge (producer -> consumer); ++consumer.in_edges
    return StatusOr<unique_ptr<...>>

The tracker constructor (0x126cda80) builds an internal LloDependencyGraph at tracker+0x100 and a LatencyTable (created from TpuVersion, the Target+0x398 field) at tracker+0xf8. Readiness is a standard in-edge-count test:

function XluOperationIsReady(tracker, op):            // sub_126cd920
    // dispatch on the variant discriminant at op[0x40]:
    switch op[0x40]:
        case 0: value = TransposeTile.read_set.last_element  // InlinedVector at op+0x08
        case 1: value = RpuOperation.lloValue                // op+0x10
        case 2: value = XluControlOperation.lloValue         // op+0x00
        default: Fatal("control != nullptr")
    node = tracker.internal_graph.GetNode(value)     // [tracker+0x100]
    return node.in_edge_count == 0                   // all predecessor XLU ops scheduled

RemoveScheduledXluOperation @ 0x126ccfa0 is the "fire" step — it removes a scheduled op and decrements its successors' in-edge counts. The edges are exactly the LloValue def-use chains: a producer XLU op whose written ArchRegister is read by a later XLU op becomes its predecessor (IsOperandOf confirms the genuine RAW/WAR relation). The consumers of the tracker are ComputeCombinablePairs (0x126d2480, fuses ops with equal RpuOperationMetadata), ReorderToShortenCriticalPath (0x126d3460), AssignXlu (0x126d3100), and AssignSourceBus (0x126d70e0).

FunctionAddressRole
CrossXlu...::Create0x126cc9a0Topo-mark → new tracker → per-node edge build
CrossXlu...::ctor0x126cda80Internal LloDependencyGraph + LatencyTable
XluOperationIsReady0x126cd920Readiness = in-edge count == 0
RemoveScheduledXluOperation0x126ccfa0List-scheduler fire step
LloDependencyGraphNode::IsOperandOf0x14427d60Confirms RAW/WAR data dependency

NOTE — the edge weights the dependency tracker feeds the critical-path reorder are the per-version op latencies in the LatencyTable (LatencyTable::Create @ 0x1c89fba0).


ComponentRelationship
ArchRegister / ArchRegisterInstance (result-fifo / arch-register enums)The physical-slot layer this numbering maps to (RegisterType, regno)
ComputeXluOperations (result-fifo / arch-register enums)Builds the per-op ArchRegister read/written sets the tracker uses
Target register-count fieldsThe four per-class counts (0x498..0x4c0) the numbering prefix-sums

Cross-References

  • ResultFifo and ArchRegister Enums — the 50-member ArchRegister physical numbering and the 25 result FIFOs this numbering sits above
  • XLU Op Roster — the cross-lane opcode→factory table whose ops the variant classifiers tag
  • Slot: VPU — the vector-processing slot whose registers this numbering assigns
  • Slot: MXU — the matmul slot, its gain/latch matrix-register file distinct from these GPR classes
  • MC Emitter — the machine-code emitter that consumes arch_regno at encode time
  • opcode_info_big Record Format — the per-opcode descriptor holding the ArchRegister read/written lists
  • Bundle Model — the per-gen VLIW bundle these arch registers are encoded into