Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

MRB Chain Allocator

All addresses on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build-id 89edbbe81c5b328a958fe628a9f2207d, not stripped — every symbol below is a demangled C++ name). Section map: .text/.rodata VMA == file offset; .data.rel.ro VMA − 0x200000 == file offset; .data VMA − 0x400000 == file offset. Other versions will differ.

Abstract

MrbChainAllocator is the Stage-2 pass that decides which matrix-result buffer (MRB) FIFO entry each matmul accumulation chain occupies, and for how long. A TPU matmul does not write its result to a register; it pushes operands into the systolic array and the partial sum lands in one of a small pool of result-FIFO buffers (the MRB) many cycles later. When several matmuls accumulate into one running sum — a K-tiled reduction — they form an accumulation chain that must hold a single MRB entry from the chain's first matmul until its result is consumed. The allocator's job is to place every chain on the finite MRB pool over a program-order timeline, freeing each entry the instant its result retires so a later chain can reuse it.

The natural reference frame is linear-scan register allocation over a single physical register class, with one twist: the "registers" are a FIFO with a fixed retire latency, so eviction is keyed on time, not on liveness intervals. The allocator walks the chains in matmul issue order; at each step it advances a monotone clock (AdvanceTimeTo), which evicts every chain whose result became available before "now" and recycles that chain's MRB entry into a per-chunk free pool (ReleaseMrbReservation); it then reserves an entry for the current chain — either a fresh insert or, if the chain already holds an entry from an earlier matmul, a replace_data that extends the entry's unevictable-until time. A chain that would have to hold its entry past the available window is cut at the current time (SplitAccumulationChain) and its tail is deferred to be re-reserved later. This is the MRB analog of the MXU assignment bin-packer — but where the MXU packer runs a greedy min-makespan search over N interchangeable lanes, the MRB allocator is a deterministic program-order sweep with a time-keyed eviction pool, because the result FIFO is an ordered fixed-retire-latency resource rather than N parallel matmul lanes.

This page documents the timeline algorithm and the chain-assignment logic: the central chains_unevictable_until_ boost bimap (ProgramOrder-ordered on one side, chain-pointer-hashed on the other), the AssignMrbEntriesToChains driver loop, the AdvanceTimeTo eviction engine, the ReleaseMrbReservation per-chunk recycle pool, and the SplitAccumulationChain defer-tail mechanism. The per-step cost that advances the reservation end-times — ExtendMrbReservation's two parallel push_time + min(throughput, cap) and push_time + min(latency, cap) advances — is the cost cell consumed by this timeline; the throughput and edge terms are priced by the CycleTable (CycleTableInstruction) and LatencyBetween, each clamped from above by an option cap (+0x20/+0x28) and written into separate per-MXU timeline arrays, not summed. The second half of this page closes the public edge-latency surface: the optional LatencyBetween jitter (a flag-gated uniform-random perturbation, off by default) and the trace-instruction edge clamps.

For reimplementation, the contract is:

  • The chains_unevictable_until_ bimap relation type, and the two reservation arms (insert for a new chain, replace_data to extend an existing one).
  • The driver loop discipline: introsort the accumulations by matmul program order, then per accumulation Split → AdvanceTimeTo → FIFO-push rounding → Reserve → ExtendMrbReservation.
  • The eviction rule: advancing the clock frees every chain whose matres_program_order is strictly less than the new time, recycling each freed MrbEntry into the per-chunk free pool.
  • The split rule: a chain cannot be split in the past; the cut keeps the head's already-consumed accumulations and defers the tail in a btree_map keyed by program order.
  • The LatencyBetween wrapper: base edge → optional jitter (only if a BitGen* is installed) → trace-edge clamps.
Driver(anon)::AssignMrbEntriesToChains @0x10f4ac60 (Target&, CycleTable&, LatencyTable&, Span<const AccumulationChain>, int, MrbAccumulationOptions)
Per-step cost cellMrbChainAllocator::ExtendMrbReservation @0x10f58800 (two parallel push + min(throughput, cap) / push + min(latency, cap) timeline advances; priced by CycleTable/LatencyBetween)
Eviction engineMrbChainAllocator::AdvanceTimeTo(StrongInt<ProgramOrder>) @0x10f5e9e0
Recycle poolMrbChainAllocator::ReleaseMrbReservation(MrbEntry) @0x10f5f9e0
Split / defer-tailMrbChainAllocator::SplitAccumulationChain @0x10f598e0
Central structurechains_unevictable_until_ — boost bimap<set_of<StrongInt<ProgramOrder>>, unordered_set_of<AccumulationChainAfterSplit*>, with_info<AccumulationWithOriginalChain>>
Jitter wrapperLatencyTable::LatencyBetween(LloValue,LloValue) @0x1c89f820; base ctor @0x1c89f800
Source fileplatforms/xla/service/jellyfish/mxu_accumulation.cc (RetChecks at lines 1017, 1332, 1371)
ConfidenceCONFIRMED (byte-anchored) unless a row or callout says otherwise

The Central Data Structure

Purpose

The allocator must answer two questions in O(1)/O(log n) at every loop step: (a) which chain is the next to retire (so the clock can evict it), and (b) does this chain pointer already hold a reservation (so the reserve becomes an extend rather than a new entry)? A single Boost bimap answers both: one side is ordered by retire time, the other is hashed by chain pointer.

chains_unevictable_until_

The relation type is byte-exact, demangled from the final_erase_ instantiation @0x10f61d60 and re-confirmed inline in SplitAccumulationChain @0x10f598e0:

// boost::bimaps::relation::mutant_relation<...> — chains_unevictable_until_
bimap<
    set_of<            StrongInt<ProgramOrder>,         std::less<...> >,   // LEFT  — ORDERED by retire time
    unordered_set_of<  AccumulationChainAfterSplit*,    boost::hash<...> >, // RIGHT — HASHED by chain pointer
    with_info<         AccumulationWithOriginalChain >                      // INFO  — next_accumulation
>;
  • LEFT key = the chain's matres_program_order: the program-order time at which the chain's matrix result becomes available, i.e. the time until which its MRB entry is un-evictable. The LEFT index is an ordered (set_of) tree, so its front node is always the next chain to retire.
  • RIGHT key = the AccumulationChainAfterSplit* chain pointer, in an unordered_set_of hashed by boost::hash. A reserve looks the chain up here to decide insert-vs-extend.
  • INFO = an AccumulationWithOriginalChain (next_accumulation), the accumulation that follows the one currently reserving — carried alongside the relation rather than as a key.

NOTE — the bimap is the eviction index; it is distinct from the per-MXU chains_next_accumulating_at_[mxu_id] array that AdvanceTimeTo post-checks (RetCheck string "chains_next_accumulating_at_[mxu_id].left.begin()->first >= to", mxu_accumulation.cc:1371) and from the per-MXU mrb_entries_[mxu_id][curr_level] size-class block map (insert string "mrb_entries_[mxu_id][curr_level].insert(MrbBlock{.offset = 2 * mrb_block.offset + 1}).second"). Three structures coexist in one allocator: the time-ordered eviction bimap, the per-MXU next-accumulation index, and the per-MXU level-keyed block map. Only the first drives eviction; this page documents it.

Allocator Object Layout

Reconstructed from the field accesses across the four traced methods. Field roles are CONFIRMED; struct-name bindings for the option caps and the chunk arrays are INFERRED (no struct-layout string).

OffsetFieldMeaning
+0x00Target*the JF Target (vtable gates [+0x390] MRB-support, [+0x5e0] FIFO granule)
+0x08CycleTable*GetCyclesForThroughput source — the throughput axis of ExtendMrbReservation
+0x10LatencyTable*LatencyBetween source — the edge axis
+0x20s64 throughput capmin(throughput, cap) cell — copied from MrbAccumulationOptions
+0x28s64 latency capmin(latency, cap) cell — copied from MrbAccumulationOptions
+0x70s64 latest_matmul_monotone clock; AdvanceTimeTo/SplitAccumulationChain RetCheck guard
+0xd8..+0xe8chains_unevictable_until_the eviction bimap (LEFT-ordered head at +0xd8, RIGHT-hashed at +0xe0)
+0x7a0per-chunk free poolflat_hash_set<MrbEntry> of recycled entries, bucketed by chunk_id (ReleaseMrbReservation)
btree_map<long, unique_ptr<AccumulationChainAfterSplit>>the deferred split-off chain tails, keyed by program order

The +0x70 clock carries the canonical name latest_matmul_: both AdvanceTimeTo and SplitAccumulationChain read *(this+0x70) and name it latest_matmul in their RetCheck format strings (@0x10f5ea08, @0x10f59920).


AssignMrbEntriesToChains — the Driver

Purpose

AssignMrbEntriesToChains @0x10f4ac60 is the entry point. It receives the matmul accumulation chains for a sequence group as absl::Span<const AccumulationChain> (demangled signature, line 1), copies them into a mutable working form, constructs an MrbChainAllocator on the stack, and drives the per-accumulation loop. The whole function is the allocator's main.

Entry Point

AssignMrbEntriesToChains  @0x10f4ac60   ── (Target&, CycleTable&, LatencyTable&, Span<AccumulationChain>, int, MrbAccumulationOptions)
  ├─ copy Span<AccumulationChain> → working vector<AccumulationChainAfterSplit>
  ├─ introsort accumulations by matmul program order        @0x10f4b092  (gtl::OrderBy<Accumulation::program_order, Less>)
  ├─ construct MrbChainAllocator (stack)                    @0x10f4b113
  │     ├─ copy MrbAccumulationOptions+0x4..+0x14 → allocator+0x1c..+0x2c   (16-byte vmovups)
  │     └─ __size_returning_new per-MXU chunk arrays         @0x10f4b2cf  (count clamped ≥ 8)
  └─ per accumulation (in program order):
        SplitAccumulationChain    @0x10f4c984
        AdvanceTimeTo(now)        @0x10f4c9c3
        FIFO-push rounding        @0x10f4ca06
        Reserve (insert | replace) → ExtendMrbReservation  @0x10f4d9a8 / @0x10f4dfca

Algorithm

function AssignMrbEntriesToChains(target, cycles, latency, chains, num_mxus, opts):  // @0x10f4ac60
    // 1. Materialize a mutable working list of accumulations.
    work = copy(chains)                          // 0x650-byte AccumulationChain → 0x70-byte AccumulationChainAfterSplit

    // 2. Process chains in matmul *issue* order, so the clock only ever moves forward.
    introsort(work.accumulations,                // @0x10f4b092
              OrderBy(Accumulation::matmul_program_order, Less))   // ascending

    // 3. Build the allocator; copy the two ExtendMrbReservation caps out of the options.
    alloc = MrbChainAllocator(target, cycles, latency, opts, num_mxus)   // @0x10f4b113
    //   alloc.throughput_cap (+0x20) and alloc.latency_cap (+0x28) copied from MrbAccumulationOptions
    //   alloc.chunks[max(num_mxus, 8)] = __size_returning_new(...)      // @0x10f4b2cf

    // 4. Main loop — one iteration per accumulation, already in program order.
    for accum in work.accumulations:
        chain = accum.chain
        now   = accum.matmul_program_order

        // 4a. If a held chain cannot keep its entry to its result time, cut it now and defer the tail.
        if chain_must_yield_entry(chain, accum):
            SplitAccumulationChain(chain, accum)      // @0x10f4c984

        // 4b. Advance the monotone clock; this evicts + recycles every retired chain.
        AdvanceTimeTo(now)                            // @0x10f4c9c3

        // 4c. Size the result-FIFO push and pick the MXU instance.
        push = LloInstructionPushesToResultFifo(accum.matmul)            // @0x1d4f3600
        if target.vtable[+0x390]() > 0:               // MRB-support gate  (line 1871)
            granule = target.vtable[+0x5e0]()         // FIFO-push granule (byte) (line 1874)
            push    = ceil(push / granule) * granule  // round up to a granule multiple (idiv)
        size_class = bsr(push)                        // log2 size class   (_BitScanReverse, line 1887)
        mxu        = accum.matmul.unit_id() & 3        // MXU-instance index from matmul unit_id (line 1864/1716)

        // 4d. Reserve — new entry or extend existing — then price the step.
        it = alloc.chains_unevictable_until_.right.find(&chain)
        if it == end:                                  // first matmul of this chain
            alloc.chains_unevictable_until_.right.insert(
                { &chain, accum.matres_program_order, accum.next_accumulation })
            ExtendMrbReservation(chain, accum.next_accumulation)         // @0x10f4d9a8
        else:                                          // chain absorbs another matmul
            alloc.chains_unevictable_until_.right.replace_data(
                it, accum.matres_program_order)
            ExtendMrbReservation(chain, accum.next_accumulation)         // @0x10f4dfca

The Two Reserve Arms

The reserve is the inlined ReserveMrbEntry (a two-arm absl::Overload visitor, lambdas @0x222f5f10/@0x222f5f40). The arm is selected by whether the chain pointer is already in chains_unevictable_until_.right:

ArmConditionBimap opExtendMrbReservation siteSource string
new reservationchain absent (first matmul)right.insert({&chain, matres_program_order, next_accum})@0x10f4d9a8guarded by RetCheck("!chain->mrb_entry.has_value()") cc:1542 — the chain must not already hold an entry
extend reservationchain present (absorbs a matmul)right.replace_data(it, matres_program_order)@0x10f4dfca"chains_unevictable_until_.right.replace_data(it, curr_accumulation.accumulation.matres_program_order)" @0xa104fb1

QUIRK — the two arms differ only in how they touch the bimap, not in what they price. Both call ExtendMrbReservation to advance the chain's reservation end-time by the per-step cost. The replace_data arm overwrites the LEFT key (matres_program_order) so the eviction order tracks the latest matmul of a multi-matmul chain. A reimplementation that re-inserts instead of replace_data on the extend path would leave a stale duplicate in the ordered index and free the entry too early.

The FIFO-Push Rounding

Confirmed byte-exact in the driver (lines 1869–1887). The number of result-FIFO slots a matmul pushes is LloInstructionPushesToResultFifo(matmul) @0x1d4f3600; when the target supports MRB (gate vtable[+0x390]() > 0, line 1871), the push is rounded up to a multiple of a per-gen granule (vtable[+0x5e0](), a byte, line 1874) via an integer divide; bsr(push) then yields a log2 size class (with a RetCheck(absl::has_single_bit(entries_needed)) at cc:875 guarding power-of-two, and target_level < kLevelsInMrbAllocator). The MXU instance is accumulation.matmul->unit_id() & 3 — low 2 bits of the matmul's unit id, not its program order (line 1864; the unit id is read at line 1716 and RetCheck-guarded by "accumulation.matmul->unit_id().has_value()" at cc:1518).

NOTE — the Target vtable layer here is not the JellyfishTarget base @0x21cc6bc0; the named accessors for [+0x390] (MRB-support count) and [+0x5e0] (FIFO-push granule) were not resolved. The slot offsets, the >0 gate, and the round-up arithmetic are byte-exact; the accessor names are INFERRED.


AdvanceTimeTo — the Eviction Engine

Purpose

AdvanceTimeTo(StrongInt<ProgramOrder> new_time) @0x10f5e9e0 is what makes the allocator a timeline. Advancing the clock to the current matmul's program order frees every MRB entry whose result has already retired — that is, every chain whose matres_program_order is strictly less than new_time — recycling each freed entry into the per-chunk pool and erasing it from the eviction bimap.

Algorithm

function AdvanceTimeTo(new_time):                        // @0x10f5e9e0
    // 1. The clock is monotone: you may never advance backward.
    if this.latest_matmul_ /*+0x70*/ > new_time:         // line 128
        FATAL "MrbEntries must be reserved in program order "
              "(latest_matmul=%v, current_matmul=%v)"    // mxu_accumulation.cc:1332  @0xa0f5d91

    // 2. Evict every chain whose result became available before now.
    while front = chains_unevictable_until_.left.begin():    // ordered-by-ProgramOrder head  @+0xd8
        if front.matres_program_order >= new_time:           // line 175 (cmp [node-0x88], to)
            break                                            // nothing left retires before now
        chain = front.chain
        if front.complete /* relation-node byte *(node-8) != 1 → chain is done */:
            ReleaseMrbReservation(chain.mrb_entry)            // recycle  (lines 178-205)
            // chain+0x2c is RetChecked nonzero here (BUG() if 0) before reading mrb_entry @chain+0x20
        else: // *(node-8) == 1: chain is now "evictable, next Accumulation is ..." (deferred, not released)
        chains_unevictable_until_.final_erase_(front)         // remove from bimap  @0x10f61d60

    // 3. Post-condition: each per-MXU next-accumulation index head is now >= new_time.
    for mxu in 0 .. num_mxus:                                 // lines 706-739
        RetCheck(chains_next_accumulating_at_[mxu].left.begin()->first >= new_time)  // cc:1371

    this.latest_matmul_ = new_time                            // commit the clock  (line 754)

GOTCHA — the guard fires when latest_matmul_ > new_time (line 128), i.e. time must be non-decreasing (FATAL on current > new, success on current <= new). Because the driver feeds accumulations in introsorted program order, new_time never regresses and the guard never trips in normal operation; it is a structural invariant check, not a runtime branch. The eviction comparison in the loop, by contrast, is strict <: a chain whose result lands exactly at new_time is not yet retired and keeps its entry.

NOTE — the eviction loop branches on the relation-node "complete" byte at *(node−8): when it is ≠ 1 the chain is done, so its MRB entry is recycled via ReleaseMrbReservation (VLOG "… releasing MRB entry, as it is complete", cc:1355); when it is == 1 the chain still has a pending next-accumulation, so it is logged as "Chain holding … is now evictable, next Accumulation is …" (mxu_accumulation.cc:1347) and erased without a release. On the release path the chain's own *(chain+0x2c) validity byte is asserted nonzero (BUG() if 0) before the MrbEntry is read from chain+0x20/chain+0x28.


ReleaseMrbReservation — the Recycle Pool

Purpose

ReleaseMrbReservation(MrbEntry) @0x10f5f9e0 returns a freed result-FIFO entry to a per-chunk pool of available entries, so a later reservation can reuse it instead of allocating a new chunk slot. The MRB pool is a true FIFO with fixed retire latency; recycling preserves that ordering.

MrbEntry and the Pool

An MrbEntry is a packed { short chunk_id, int fifo_index, byte format }. The pool at this+0x7a0 is an Abseil flat_hash_set<MrbEntry> whose buckets are addressed by chunk_id, and each bucket holds an absl::linked_hash_set<MrbEntry> (insertion-ordered, so the recycle list itself is a FIFO).

function ReleaseMrbReservation(entry):                   // @0x10f5f9e0
    // 1. Address the per-chunk bucket: chunk_id selects a 0x40-byte sub-table.
    bucket = pool_base(this+0x7a0) + (entry.chunk_id << 6)    // line 33: (int16)entry << 6

    // 2. Hash the whole {chunk_id, fifo_index, format} with the absl CRC32 mixer.
    h = MixingHashState(kSeed @0x22042400)               // _mm_crc32_u64 over the 3 fields (lines 42-43)

    // 3. Insert a recycle node; on first use, grow the SOO table.
    node = operator new(0x20)                            // line 111
    node.entry  = entry                                  // {chunk_id, fifo_index, format}
    node.format = entry.format
    push_front(bucket.free_list /* head @+0x28 */, node) // lines 114-119
    ++bucket.count                                       // free-entry count @+0x38 (line 118)
    return node
    // VLOG "Freeing MRB entry " @0xa1b5c8d

QUIRK — the recycle node is operator new(0x20) and linked into a doubly-linked linked_hash_set per chunk, not pushed onto a plain free stack. The linked hash set both dedups (an entry cannot be freed twice into the same chunk) and preserves recycle order. A reimplementation that uses a bare LIFO free list will reuse entries in the wrong order and diverge from the binary's deterministic FIFO reuse.


SplitAccumulationChain — Cut and Defer

Purpose

SplitAccumulationChain(AccumulationChainAfterSplit&, AccumulationWithOriginalChain const&) @0x10f598e0 handles the contention case: a chain that cannot keep its MRB reservation up to the current matmul is cut at the split point. The head retains the accumulations consumed so far; the tail becomes a fresh chain deferred to be re-reserved later in program order.

Algorithm

function SplitAccumulationChain(chain, split_point):     // @0x10f598e0
    // 1. You cannot split in the past — the cut must be at or after the clock.
    if split_point.matmul_program_order /*accum+0x8*/ < this.latest_matmul_ /*+0x70*/:   // line 63
        FATAL "Cannot split an AccumulationChain in the past"   // mxu_accumulation.cc:1017  @0x84dcedc
        // RetCheck expr: "split_point.accumulation.matmul_program_order >= latest_matmul_"  @0x87b7fa6

    // 2. If the chain still occupies an MRB entry, free it — the split forces a wait.
    if chain.holds_mrb_entry /* *(chain+0x2c) == 1 */:                   // line 74
        chains_unevictable_until_.final_erase_(chain)                    // line 251
        ReleaseMrbReservation(chain.mrb_entry)                           // line 287
        // VLOG "Freeing MRB entry ..., since the occupying chain has been split at ..."  @0xa1daa56 (cc:1033)

    // 3. Cut the accumulation span at the split point.
    consumed   = split_point.matmul - chain.begin /*chain+0x8*/         // line 120
    CHECK(consumed <= chain.len /*chain+0x18*/)                         // "pos > size()" guard
    tail_begin = split_point.matmul                                     // line 119 (a4[13])
    tail_span  = chain.span_ptr /*chain+0x10*/ + 0x60 * consumed        // 0x60-byte stride
    tail_len   = chain.len - consumed
    chain.len  = consumed        /* the head keeps the already-consumed accumulations */   // line 128

    // 4. Defer the tail: store it as a heap-owned chain keyed by program order.
    tail = operator new(0x30) { begin=tail_begin, span=tail_span, len=tail_len, holds_mrb=false }  // line 195
    deferred_tails /* btree_map<long, unique_ptr<AccumulationChainAfterSplit>> */
        .insert({ split_point.matmul_program_order, tail })            // @0x10f5d2e0 (line 204)

GOTCHA — step 3 keeps the head short and emits the tail as the new deferred chain — the head's len is set to consumed (*(chain+0x18) = consumed, line 128), not to len − consumed; the heap-owned tail node gets len − consumed. The split point is the boundary: everything before it stays in place, everything from it onward is deferred and re-reserved when the clock later reaches split_point.matmul_program_order.

NOTE — the tail is owned by a btree_map<long, unique_ptr<AccumulationChainAfterSplit>> keyed by program order, so deferred remainders are walked back into the timeline in order. This is the one place the allocator allocates a chain on the heap; the original chains live in the driver's working vector.


Net Policy and the MXU Contrast

The allocator is a single forward sweep with three time-keyed effects per step: split-if-contended, advance-and-evict, reserve-or-extend. There is no makespan search and no backtracking. The per-step advance is ExtendMrbReservation's two parallel writes — push_time + min(CycleTableInstruction, cap@+0x20) into the per-MXU throughput timeline and push_time + min(LatencyBetween, cap@+0x28) into the per-MXU result-entry timeline — the throughput axis from the CycleTable, the edge axis from LatencyBetween, each clamped from above by an option cap (+0x20/+0x28). The two are not summed.

AspectMRB Chain Allocator (this page)MXU Assignment Bin-Packer
Resource shapeone ordered FIFO with fixed retire latencyN interchangeable MXU passes
Disciplinedeterministic program-order sweepgreedy min-makespan search
Evictiontime-keyed (AdvanceTimeTo evict-by-matres_program_order)n/a — places, never evicts
Scoringnone — placement is forced by timeLatchLatencyChangeAfterAdding per candidate
Contention handlingsplit + defer tail (SplitAccumulationChain)choose a different MXU pass
Shared cost inputsCycleTable matmul throughput + LatencyBetween edgesame

The structural difference is the point: the result FIFO is an ordered resource whose entries retire on a clock, so the optimal placement is forced by time and no search is needed. The MXU lanes are interchangeable, so a search over assignments pays off. They share the cost cells but not the allocation algorithm.


LatencyBetween — the Public Edge Wrapper

Purpose

LatencyTable::LatencyBetween(LloValue from, LloValue to) @0x1c89f820 is the public dispatcher for the read-after-write edge latency that every scheduler stage prices against — including the ExtendMrbReservation edge axis above. It wraps the per-gen LatencyBetweenInternal (the base model, documented on bundle-aware-cost) with two gen-invariant corrections: an optional random jitter and the trace-instruction edge clamps. This page owns those two corrections.

Algorithm

function LatencyBetween(from, to):                       // @0x1c89f820
    raw = this->vtable[+0x18](from, to)                  // base: per-gen LatencyBetweenInternal

    // --- OPTIONAL JITTER ---  (line 27)
    bitgen = this->jitter_bitgen  /* this+0x10 */
    if bitgen != null:                                   // null by default → skip
        raw += Uniform(bitgen, 0, 101)                   // bounds (0,101) passed to absl UniformDistributionWrapper<int>
                                                         // via DistributionCaller<BitGen> @0xfa7c9a0 (Randen PRNG)

    // --- TRACE-EDGE CLAMPS ---  (lines 33-52)
    if from.opcode == 0x84 /*trace-arg*/ and to.opcode == 0x84:
        cfg = AutoOr<int>::FromProtoOrDie(AutoProto)     // @0x10979760  ([from+0x10]→[+0x38]→deref→[+0xc78])
        edge = cfg.is_set ? cfg.value : 16               // default 16
        raw  = max(raw, edge)
    else if from.opcode == 0x82 /*set-tracemark*/ and (to.opcode - 0x82) <= 2:   // to ∈ {0x82,0x83,0x84}
        raw = max(raw, 2)                                // min-2 floor on tracemark → trace* edges

    return raw                                            // VLOG latency_table.cc:83 @0x878d9cb

The Jitter Knob

The jitter BitGen* lives at LatencyTable+0x10 and is initialized to 0 in the base constructor @0x1c89f800 (*(this+0x10) = 0, decompiled), and none of the per-gen derived constructors set it — so jitter is off by default. It is installed only when the command-line flag FLAGS_xla_jf_random_latency (flag name "xla_jf_random_latency" @0x84bcebc; flag global @0x223b47c8) is enabled.

KnobTypeDefaultDescription
xla_jf_random_latencyBOOL/BitGen installoff (+0x10 == null)When set, installs a Randen-AES BitGen* at LatencyTable+0x10; every edge then gains a uniform-random perturbation drawn with bounds (0, 101)

QUIRK — the jitter is a robustness/fuzzing knob, not a hardware latency term. With it on, the scheduler is exercised against latency variation so its decisions are stress-tested for sensitivity to edge-latency noise — it does not model any real silicon delay. A reimplementer pricing actual TPU cycles must leave +0x10 null; the deterministic LatencyBetweenInternal is the hardware model.

NOTE — the site that wires the flag to the BitGen* install was not byte-walked; the null default in the base ctor and the flag's name/global are CONFIRMED, but when the install happens (which factory path reads GetFlag(xla_jf_random_latency)) is not traced (LOW). The AutoOr<int> proto field for the 0x84/0x84 trace-arg edge is decoded structurally (@0x10979760); its default of 16 is byte-exact, but the proto field name is not pinned (LOW).

The Trace-Edge Clamps

The clamps keep profiling-marker instruction edges from collapsing onto each other in the schedule. The three opcodes (resolved from the opcode_string table @0x21cd0d60 via .rela.dyn RELATIVE-addend decode) are 0x82 = set-tracemark, 0x83 = trace, 0x84 = trace-arg.

from opcodeto opcodeFloor appliedSource
0x84 trace-arg0x84 trace-argmax(raw, AutoOr<int> default 16)line 36–46
0x82 set-tracemark0x82 / 0x83 / 0x84max(raw, 2)line 49–51

A trace-arg → trace-arg edge takes a configurable floor (default 16 cycles); a set-tracemark → any-trace edge takes a fixed min-2 floor. Both ensure inserted trace markers retain a minimum separation so they do not pile into one bundle, which would corrupt the emitted trace stream.


Function Map

FunctionAddressRole
(anon)::AssignMrbEntriesToChains0x10f4ac60driver: copy → introsort → allocator ctor → main loop
MrbChainAllocator::ExtendMrbReservation0x10f58800per-step cost cell (throughput and edge, each capped via min, written to separate per-MXU timelines)
MrbChainAllocator::AdvanceTimeTo0x10f5e9e0monotone clock + evict-by-matres_program_order
MrbChainAllocator::ReleaseMrbReservation0x10f5f9e0recycle a freed MrbEntry into the per-chunk pool
MrbChainAllocator::SplitAccumulationChain0x10f598e0cut a chain at the clock; defer the tail
bimap final_erase_0x10f61d60the chains_unevictable_until_ relation type (demangled)
btree_map<long, unique_ptr<…>>::insert0x10f5d2e0store the deferred split-off chain tail
LloInstructionPushesToResultFifo0x1d4f3600FIFO-push count, rounded to the per-gen granule
Target::vtable[+0x390]MRB-support count gate (> 0)
Target::vtable[+0x5e0]FIFO-push rounding granule (byte)
LatencyTable::LatencyBetween0x1c89f820base edge → optional jitter → trace clamps
LatencyTable::LatencyTable(TpuVersion)0x1c89f800base ctor — +0x10 = 0 (jitter off)
DistributionCaller<BitGen>::Impl<UniformDistributionWrapper<int>>0xfa7c9a0the Uniform(0,101) jitter draw (Randen PRNG)
AutoOr<int>::FromProtoOrDie0x10979760trace-arg 0x84/0x84 edge config (default 16)
LloOpcodeString0x1d631360opcode → name table @0x21cd0d60 (0x82/0x83/0x84)

Considerations

  • Program-order is the only ordering. Every guard (AdvanceTimeTo, SplitAccumulationChain) keys off matmul_program_order and the latest_matmul_ clock. The introsort at @0x10f4b092 is a precondition for correctness, not an optimization: if the accumulations are not in ascending matmul order, the monotone-clock RetCheck (cc:1332) fires.
  • Eviction is strict <, the clock guard is >. A chain whose matres_program_order == new_time is not evicted this step (its result has not retired); but the clock may advance to that time. The two comparisons use different relations deliberately.
  • The bimap and the per-MXU indices must stay consistent. AdvanceTimeTo post-checks every chains_next_accumulating_at_[mxu] front (cc:1371) after eviction. A reimplementation maintaining only the eviction bimap will pass the eviction loop but fail this invariant.
  • MrbChunkState / AccumulationChainAfterSplit layouts are functional. The 0x70-byte AccumulationChainAfterSplit body (+0x8 begin, +0x10 span ptr, +0x18 len, +0x20/+0x28 the MrbEntry, +0x2c "has mrb entry" byte) and the per-MXU MrbChunkState element are reconstructed from access patterns, not a struct-layout string (LOW on exact byte layout; the field roles are CONFIRMED from use).

NameRelationship
mxu-assignment-binpackerthe MXU sibling — greedy min-makespan over N lanes vs. this deterministic FIFO sweep
mxu-sequence-structthe MxuSequence/SequenceInfo record whose accumulation chains feed this allocator
mrb-fifo-msr-placementthe next layer — turns the reserved (chunk_id, fifo_index) MrbEntrys into physical FIFO/MSR addresses
bundle-aware-costowns the base LatencyBetweenInternal edge and MaxResourceCycles throughput this allocator prices against
jf-cycletablethe GetCyclesForThroughput cell that is the throughput axis of ExtendMrbReservation

Cross-References

  • TPU Scheduling Pipeline — Stage 2 placement of this allocator between the HLO scheduler and the bundle packer.
  • MXU Assignment Bin-Packer — the MXU sibling allocator; the contrast that explains why MRB needs no makespan search.
  • MxuSequence / SequenceInfo — the per-sequence record holding the accumulation chains this pass consumes.
  • MRB FIFO / MSR Placement — the downstream pass that materializes the reserved MrbEntrys into result-FIFO and MSR addresses.
  • Latch Assignment & Overrun — the latch-index commit that the bundle packer reads as a slot-legality input.
  • Bundle-Aware Cost — the base LatencyBetween/LatencyBetweenInternal edge axis and MaxResourceCycles throughput this allocator's cost cell uses.
  • JF CycleTable — the per-gen matmul throughput cycles that drive the ExtendMrbReservation recurrence.
  • MXU Latency Overview — the MXU occupancy reservation model that complements this result-buffer allocator.
  • MXU Slot — the LLO MXU instructions (matmul/matpush/matres) whose accumulation chains this allocator places.
  • LLO Opcode Enum — the opcode numbering behind the 0x82/0x83/0x84 trace-instruction identities.
  • Binary: extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so (build-id 89edbbe81c5b328a958fe628a9f2207d)
  • Index entry: Part VIII — Instruction Scheduling & Bundle Packing — back to index