Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

MRB FIFO / MSR Placement

Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d, not stripped — every symbol below is a demangled C++ name). Section map: .text/.rodata VMA == file offset; .data.rel.ro VMA − 0x200000 == file offset. Other versions differ.

Abstract

This is the physical-placement layer of Stage 2's matrix-resource assignment. The MRB chain allocator decides, on a program-order reservation timeline, which matrix-result-buffer (MRB) chunk and FIFO index each matmul accumulation chain occupies — it hands each chain a MrbEntry = {chunk_id, fifo_index, format}. Two MxuAssigner methods then turn those abstract reservations into the two concrete attributes the per-generation encoder serializes into the bundle word: AllocateMrbEntriesAsFifo (0x10f3ef80) assigns each matmul and each matres a physical result-FIFO slot address, and BounceBetweenMsrs (0x10f3fae0) assigns each MxuSequence a physical matrix-staging-register (MSR) bankmsra (0) or msrb (1). Both run once per MXU quadrant, dispatched from MxuAssigner::VisitRegion over an array<Span<MxuSequence>,4> (the four MXU instances of a region).

The familiar reference frame is a hardware ring buffer paired with a double-buffered staging latch. The result FIFO is a per-MXU circular buffer: AllocateMrbEntriesAsFifo maintains two running cursors per MXU — a write cursor (advanced on the matmul/push side) and a read cursor (advanced on the matres/pop side) — each advanced by the per-op entry count, rounded up to a write-block granule, and wrapped modulo the FIFO depth. The MSR bounce is the classic ping-pong of a weight-stationary array: while msra feeds one sequence's matmuls, the next sequence's gain matrix stages into msrb, then the banks swap; BounceBetweenMsrs realizes that by toggling a one-bit bank index at the top of every sequence and stamping it onto each op via set_matrix_staging_register. The one wrinkle is the LMR/XMR path: a vector-matmul-lmr (opcode 0xa5) reads its gain matrix directly from a load-matrix register rather than staging through an MSR, so the bounce is meaningless for it and the whole pass returns early.

This page documents the two placement functions byte-for-byte, the two struct layouts the allocator carries them on (MrbChunkState — actually two distinct per-chunk arrays — and the 0x70-byte AccumulationChainAfterSplit record), and the downstream consumer (LloInstruction::set_mrb_addressset_mrb_address_unrestricted, the matmul-family field at +0x42, read back by mrb_address). It does not re-derive the reservation timeline (that is the chain allocator) or the bundle bit positions of the MSR/MRB fields (those are the MXU slot and per-gen bundle pages).

For reimplementation, the contract is:

  • The FIFO discipline — two int32 cursor arrays sized by num_mxus, per-MXU select unit_id & 3, advance by MxuResultEntriesPushed/Popped, round up to MinMrbWriteBlockSize, wrap modulo ResultFifoEntryCount(kMrf0, version), committed by set_mrb_address.
  • The MSR bounce — the 0xa5 early-out pre-scan, the per-sequence msr ^= 1 toggle, and set_matrix_staging_register's per-op-family destination byte (+0x46/+0x44/+0x42/+0x41).
  • The two struct layouts — the 0x70-byte per-MXU timeline element vs the 0x40-byte per-chunk free-pool record, and the 0x70-byte AccumulationChainAfterSplit, all keyed by MrbEntry.chunk_id.
  • The per-gen accessor surface — six Target vtable slots that price the placement, and why the path is gen-gated off in v0.0.40 (every shipped TensorCore target returns MatmulResultBufferEntries() == 0).
FIFO placementMxuAssigner::AllocateMrbEntriesAsFifo(Target const&, Span<unique_ptr<MxuSequence>>) @0x10f3ef80
MSR bounceMxuAssigner::BounceBetweenMsrs(Target const&, Span<unique_ptr<MxuSequence>>) @0x10f3fae0
Per-quadrant dispatchMxuAssigner::VisitRegion @0x10f3a640array<Span,4> lambdas $_5 @0x10f40a80, $_6 @0x10f40b60
Physical FIFO writeLloInstruction::set_mrb_address(int) @0x1d4d93c0set_mrb_address_unrestricted(int) @0x1d4e9780 (field +0x42)
MSR writeLloInstruction::set_matrix_staging_register(Msr) @0x1d4d7d40 (fields +0x46/+0x44/+0x42/+0x41)
FIFO depthResultFifoEntryCount(kMrf0=8, TpuVersion) @0x1d631520; table @0xb53e240 = {16,16,16,48,224,256}
Msr enumMsrToString @0x1d629720: 0 → "msra", 1 → "msrb"
Carried onMrbChainAllocator per-MXU timeline @alloc+0xf0 (stride 0x70), per-chunk free pool @alloc+0x7a0 (stride 0x40)
Sourceplatforms/xla/service/jellyfish/mxu_assigner.cc, mxu_accumulation.cc, llo_instruction.cc
ConfidenceCONFIRMED (byte-anchored) unless a row or callout says otherwise

The Per-Quadrant Dispatch

Purpose

Both placement functions are region-local and quadrant-local. MxuAssigner::VisitRegion (0x10f3a640) groups the region's MxuSequences by physical MXU into an array<Span<unique_ptr<MxuSequence>>,4> — one span per MXU instance — and runs that array through two InvokeObject lambdas: $_5 (0x10f40a80) calls AllocateMrbEntriesAsFifo four times (once per quadrant span, paired [rcx+0/0x10/0x20/0x30] pointer with [rcx+8/0x18/0x28/0x38] length); $_6 (0x10f40b60) calls BounceBetweenMsrs the same four times. Each lambda early-outs if a call returns a non-OK Status. The captured first word of each lambda is the Target& proxy.

Structure

MxuAssigner::VisitRegion                          0x10f3a640
  build array<Span<unique_ptr<MxuSequence>>, 4>   ── one span per MXU instance (quadrant)
  ├─ $_5  InvokeObject(AllocateMrbEntriesAsFifo)   0x10f40a80
  │     AllocateMrbEntriesAsFifo(target, quadrant[0])     0x10f40abe
  │     AllocateMrbEntriesAsFifo(target, quadrant[1])     0x10f40ad2
  │     AllocateMrbEntriesAsFifo(target, quadrant[2])     0x10f40ae6
  │     AllocateMrbEntriesAsFifo(target, quadrant[3])     0x10f40afc
  │       └─ early-out if Status != OK (1)
  └─ $_6  InvokeObject(BounceBetweenMsrs)          0x10f40b60
        BounceBetweenMsrs(target, quadrant[q]) × 4        0x10f40f91…

QUIRK — the FIFO cursors are per-MXU inside one quadrant call, not global. AllocateMrbEntriesAsFifo allocates its two cursor arrays at function entry and frees them at function return, so each of the four quadrant calls starts both cursors at zero. The arrays are sized num_mxus, and the per-element index comes from unit_id & 3 (the per-op MXU instance, HIBYTE(WORD[+0xb]) & 3), so within one call the cursors are independent per MXU. A reimplementation that hoists the cursors to region scope (carrying the FIFO write position across quadrants) will diverge — the binary resets them per call. Whether the hardware actually expects a fresh FIFO per quadrant region or a cross-region carry is not traced (the cursor lifetime is strictly intra-call here).


FIFO Placement — AllocateMrbEntriesAsFifo

Purpose

Walk one quadrant's MxuSequences in program order and assign every matmul and every matres a concrete result-FIFO slot address via set_mrb_address. The result FIFO is modelled as a per-MXU circular buffer of ResultFifoEntryCount(kMrf0, version) entries; matmuls push into it (advancing a write cursor) and matreses pop from it (advancing a separate read cursor).

Entry Point

AllocateMrbEntriesAsFifo(Target const& [rdi], Span<unique_ptr<MxuSequence>> [rsi=ptr, rdx=count])
  num_mxus      = Target[+0x4ac]                         ── 0x10f3ef9b  (movsxd; same field as LatchLhs)
  write_ptr[]   = operator new(4 * num_mxus); memset 0   ── 0x10f3efc4  ([rbp-0x48], matmul/push side)
  read_ptr[]    = operator new(4 * num_mxus); memset 0   ── 0x10f3efe0  ([rbp-0x68], matres/pop side)
  fifo_depth    = ResultFifoEntryCount(8, Target[+0x398])── 0x10f3f019
  for each MxuSequence in span: …

Algorithm

// MxuAssigner::AllocateMrbEntriesAsFifo @0x10f3ef80
//   a1 = Target const&  (the MxuAssigner `this` is dead; rdi carries the Target&)
//   span = Span<unique_ptr<MxuSequence>>  (one MXU quadrant)
function AllocateMrbEntriesAsFifo(target, span):
    num_mxus = target[+0x4ac]                       // CONFIRMED Target field (== LatchLhs num_mxus read)
    if num_mxus > 0:
        write_ptr = new int32[num_mxus]; memset(write_ptr, 0)   // matmul/push cursor, [rbp-0x48]
        read_ptr  = new int32[num_mxus]; memset(read_ptr,  0)   // matres/pop cursor,  [rbp-0x68]
    fifo_depth = ResultFifoEntryCount(kMrf0 /*=8*/, target[+0x398] /*DeepseaVersion*/)  // @0x1d631520

    for seq in span:
        RET_CHECK(!seq->matmuls.empty())            // mxu_assigner.cc:881  ([seq+0x50] count)
        // per-MXU quadrant select: bits 8..9 of LloValue WORD[+0xb], gated by the 0x400 "has mxu" bit
        seq_unit = (seq->matmuls[0].word[0xb] & 0x400) ? ((HIBYTE(word) & 3) | HAS_VALUE) : 0
        RET_CHECK(seq->matmuls[0]->unit_id().has_value())       // mxu_assigner.cc:882

        matres_index = 0
        for matmul in seq->matmuls:                 // [seq+0x48] ptr, [seq+0x50] count
            fmt  = matmul.matmul_data_format()      // @0x1d4e8440
            push = target.MxuResultEntriesPushed(matmul_op, fmt)    // vtbl[+0x5f0]
            unit = matmul->unit_id()
            RET_CHECK(unit.has_value())             // mxu_assigner.cc:890
            RET_CHECK(unit == seq_unit)             // mxu_assigner.cc:893 "same unit id"
            mxu  = unit & 3                          // per-MXU index into the cursor arrays

            // ---- THE PHYSICAL WRITE (matmul push side) ----
            set_mrb_address(matmul, write_ptr[mxu])             // @0x1d4d93c0  ← FIFO slot
            write_ptr[mxu] += push                              // advance by entries this matmul pushes
            g = target.MinMrbWriteBlockSize()                   // vtbl[+0x5e0], a byte (idiv divisor)
            write_ptr[mxu] = ceil_div(write_ptr[mxu], g) * g    // round UP to the write-block granule
            write_ptr[mxu] %= fifo_depth                        // wrap into the FIFO

            // ---- THE MATRES POP SIDE (only when this matmul pushes entries) ----
            if push > 0:
                base = read_ptr[mxu]
                acc  = 0
                while acc < push:                   // drain `push` worth of result entries
                    matres = seq->matreses[matres_index]        // [seq+0x60] ptr, [seq+0x68] count
                    RET_CHECK(matres_index < seq->matreses.size())   // mxu_assigner.cc:902 / :913
                    RET_CHECK(matres->unit_id().has_value())         // mxu_assigner.cc:924
                    RET_CHECK(*matres->unit_id() == unit)            // mxu_assigner.cc:925
                    rel = target.MrbRelativeAddressFromOffset(matres_op, acc)   // vtbl[+0x5e8]
                    set_mrb_address(matres, (base + rel) % fifo_depth)          // @0x1d4d93c0
                    acc += target.MxuResultEntriesPopped(matres_op, fmt)        // vtbl[+0x5f8]
                    matres_index += 1
            read_ptr[mxu] = ceil_div((read_ptr[mxu] + push), g_pop) * g_pop % fifo_depth   // symmetric

        RET_CHECK(matres_index == seq->matreses.size())          // mxu_assigner.cc:949 "too many matreses"
    return OK

The round-up arithmetic is emitted as a signed idiv plus the standard "did we truncate toward zero" fixup, exactly as the decompiler renders it at 0x10f3f218:

// write_ptr[mxu] = ceil(write_ptr[mxu] / g) * g, for a signed dividend (0x10f3f1f8..0x10f3f234)
q = write_ptr[mxu] / g;                                  // idiv
write_ptr[mxu] = g * (q + ((write_ptr[mxu] > g*q) & (q >= 0)));   // bump quotient if there was a remainder
write_ptr[mxu] %= fifo_depth;                            // final wrap

GOTCHA — the matmul cursor and the matres cursor are independent. write_ptr (matmul push) and read_ptr (matres pop) advance separately; the read cursor is not derived from the write cursor. The matres address is (read_base + MrbRelativeAddressFromOffset(op, acc)) % fifo_depth, where acc accumulates MxuResultEntriesPopped until it reaches the matmul's push count. A reimplementation that places matres results at the same slot as the producing matmul (assuming a single shared cursor) will mis-address every result-pop; the read/write skew is the result-FIFO occupancy model the chain allocator's reservation timeline prices.

The Six Target Accessors

AllocateMrbEntriesAsFifo reaches the per-generation geometry entirely through Target virtuals. These six slots resolve the formerly-inferred Target[+0x390]/[+0x5e0] slots that the chain allocator and the cost model left unnamed.

Target vtable slotvptr offsetBase implRole on this path
MatmulResultBufferEntries+0x3900x1d61da40MRB-support gate: set_mrb_address aborts unless > 0
SupportsMatmulResultPack+0x3980x1d61da80sibling; not on this hot path
MinMrbWriteBlockSize+0x5e00x1d490d80 (LogFatal)the FIFO push round-up granule (idiv divisor)
MrbRelativeAddressFromOffset+0x5e80x1d490dc0 (LogFatal)(op, offset) → relative MRB address on the pop side
MxuResultEntriesPushed+0x5f00x1d61e9c0entries a matmul pushes; advances write_ptr
MxuResultEntriesPopped+0x5f80x1d61ea00entries a matres pops; advances matres_index

NOTE — Target[+0x4ac] is num_mxus, Target[+0x398] is the version. num_mxus (movsxd rcx,[rdi+0x4ac]) is read identically by MxuAssigner::LatchLhs (0x10f3b791, imul rcx, Target::ChunksPerTile()); Target[+0x398] is the tpu::TpuVersion (an unsigned in [0,6)) the FIFO-depth table is indexed by. Both are CONFIRMED cross-function.

FIFO Depth Table

ResultFifoEntryCount(ResultFifo fifo, TpuVersion ver) (0x1d631520) is a switch over the fifo argument; the matmul-result FIFO kMrf0 = 8 lands in the arm that returns table[ver] from .rodata at 0xb53e240. Read directly from the ELF (struct.unpack):

TpuVersion ordinalGenerationResultFifoEntryCount(kMrf0, ver)
0Jellyfish16
1Dragonfish16
2Pufferfish16
3Viperfish48
4Ghostlite224
56acc60406 (TPU7x)256

The version ordinals ≥ 6 take the LogMessageFatal("invalid platform type:") arm; the table has exactly six live rows. This is the modulus every cursor wraps at.

MxuResultEntriesPushed / Popped — the Per-Op Entry Counts

The two count accessors are overridden per generation even on the gens that gate MRB off (below). The Viperfish overrides are the byte-exact reference:

// ViperfishTarget::MxuResultEntriesPushed @0x1d499ae0  (a2 = LloOpcode, a3 = MatmulDataFormat)
if a2 != 0xa5 /*vector-matmul-lmr*/:
    CHECK(a3-1 < 8)                                      // valid MatmulDataFormat
    return push_table[a3-1]   // @0xa2e6180 = {2,4,8,8,4,4,4,4}
else:
    CHECK((a3-2) in {0,3,4,5,6})                         // 0xa5 valid-format mask 0x79 (bits 0,3,4,5,6)
    return lmr_push_table[a3-2] // @0xb5307c0 = {2,0,0,1,1,1,1,0}

// ViperfishTarget::MxuResultEntriesPopped @0x1d499ba0
return 2 - MatmulDataFormatIsIntegral(fmt)              // @0x1d629240 : integral fmt → pop 1, float → pop 2

Both push tables were read directly from .rodata: 0xa2e6180 = {2,4,8,8,4,4,4,4} (indexed fmt-1) and 0xb5307c0 = {2,0,0,1,1,1,1,0} (the 0xa5 path, indexed fmt-2). The Jellyfish base (MxuResultEntriesPushed @0x1d492360) is the small fmt1→1, fmt2→2 rule; popped (@0x1d492420) is fmt1/2→1.

Result Checks

AllocateMrbEntriesAsFifo is dense with RET_CHECKs (mxu_assigner.cc line numbers are the literal third arguments to RetCheckFailSlowPath):

LineConditionFailure string
881!sequence->matmuls.empty()(the empty-matmuls guard at the top of each sequence)
882sequence->matmuls[0]->unit_id().has_value()sequence unit id present
890matmul->unit_id().has_value()per-matmul unit id present
893unit_id == sequence_unit_id"All matmuls in a sequence must have the same unit id."
902 / 913matres_index < sequence->matreses.size()matres index in range
924 / 925matres->unit_id().has_value() / *matres->unit_id() == unit_idmatres belongs to the same MXU
949matres_index == sequence->matreses.size()"Sequence had too many matreses for the number of matmuls"
(matres-drain under-run)"Sequence had too few matreses for the number of matmuls" (@0x8547285/@0x854724c)

MSR Bounce — BounceBetweenMsrs

Purpose

Assign each MxuSequence in the quadrant a matrix-staging-register (MSR) bank — msra (0) or msrb (1) — and stamp it onto every staging op. Consecutive sequences alternate banks so that the gain matrix for sequence n+1 can stage into one MSR while sequence n's matmuls drain from the other — the double-buffered weight-stationary ping-pong. The MSR each op carries is exactly the attribute the cost model's GetResourceUsage reads to charge the per-bank latch reservation.

Algorithm

// MxuAssigner::BounceBetweenMsrs @0x10f3fae0  (a1 = Target const&; span = quadrant)
function BounceBetweenMsrs(target, span):
    // ---- PRE-SCAN: abort if any sequence uses the LMR/XMR matmul path ----
    for seq in span:
        RET_CHECK(!seq->matmuls.empty())                // mxu_assigner.cc:1063  ([seq+0x50])
        for op in seq->matmuls:                         // [seq+0x48] ptr, [seq+0x50] count
            if op.opcode == 0xa5 /*vector-matmul-lmr*/:
                return OK                                // 0xa5 reads gains from LMR, bypasses the MSRs

    // ---- BOUNCE: toggle the bank at the top of every sequence ----
    msr = 1
    for seq in span:
        msr ^= 1                                         // alternate: msra(0) ↔ msrb(1)
        for latch in seq->latches:                       // [seq+0x18] ptr, [seq+0x20] count (0x8d..0x96)
            set_matrix_staging_register(latch, msr)      // @0x1d4d7d40
        // first matmul of the sequence also carries the bank
        set_matrix_staging_register(seq->matmuls[0], msr)   // [seq+0x48][0]
    return OK

NOTE — the bank is stamped onto the latch list and matmuls[0], not onto a matres. The bounce writes the bank onto the sequence's latch list ([seq+0x18]/[seq+0x20], the stationary-operand staging ops 0x8d..0x96) and onto matmuls[0] ([seq+0x48][0]). The [seq+0x48]/[seq+0x50] array is the matmuls vector, pinned by the RET_CHECK(!sequence->matmuls.empty()) string at mxu_assigner.cc:1063 guarding [seq+0x50]. The full MxuSequence sub-vector map (cross-checked against the MxuSequence struct deleter and SetLatchIndices) is: latches +0x18/+0x20, matmuls +0x48/+0x50, matreses +0x60/+0x68.

Why 0xa5 Aborts the Whole Pass

A vector-matmul-lmr (0xa5) reads its stationary operand from a load-matrix register (the Xmr space), not from one of the two MSR staging banks. The Xmr enum is {0 → "gmra", 1 → "lmr"} (XmrToString @0x1d629740), and XmrToMsr (@0x1d629780) returns OK only for gmra (0) — converting lmr (1) to an MSR is the error path "Cannot convert Xmr %s to Msr." (matrix_register.cc:100). So an 0xa5 op has no MSR to bounce, and the whole quadrant's bounce is skipped rather than partially applied.

QUIRK — one 0xa5 anywhere in the quadrant suppresses the bounce for all sequences. The pre-scan walks every matmul of every sequence first; a single 0xa5 makes BounceBetweenMsrs return before assigning any bank. A reimplementation that bounces sequence-by-sequence and only skips the 0xa5 sequences will assign MSR banks to the non-LMR sequences that the binary leaves unassigned.

set_matrix_staging_register — the Destination Bytes

The bank byte is written into a different per-op-family field depending on opcode. The dispatch is a cascade of range tests (LloOpcode is the 16-bit *a1):

// LloInstruction::set_matrix_staging_register(Msr) @0x1d4d7d40  (a2 = msr byte 0/1)
op = *a1;                                       // LloOpcode (uint16)
if      ((uint16)(op - 0x9b) <= 0xA):  a1[+0x46] = msr;   // 0x9b..0xa5  vector-matmul / -lmr
else if ((uint16)(op - 0x8d) <= 0x9):  a1[+0x44] = msr;   // 0x8d..0x96  vector-latch family (kVectorLatch*)
else if ((op & 0xFFFE) == 0xAA):       a1[+0x42] = msr;   // 0xaa/0xab   vector-load-lmr / -bf16
else if (op == 0xA8):                  a1[+0x41] = msr;   // 0xa8        vector-done-with-gains
else:  FATAL("msr unsupported for opcode: ");             // llo_instruction.cc:3414
LloOpcode rangeOp familyDestination field
0x9b..0xa5vector-matmul / vector-matmul-lmr+0x46
0x8d..0x96vector-latch family (kVectorLatch*)+0x44
0xaa, 0xabvector-load-lmr / -lmr-bf16+0x42
0xa8vector-done-with-gains+0x41
elseFATAL

The Msr Enum

MsrToString(Msr) (0x1d629720) writes a 4-byte string by integer arithmetic, confirming the enum is exactly two values:

// MsrToString @0x1d629720  (a2 = Msr ordinal)
*(uint32*)out = ((a2 != 0) << 24) + 0x6172736D;   // 0x6172736D == "msra" (LE: m,s,r,a)
// a2 != 0 adds 0x01000000 → last byte 'a'(0x61) → 'b'(0x62) ⇒ "msrb"
out[4] = 0; out.len = 4;

So Msr{0} = "msra", Msr{1} = "msrb". This is the bank index BounceBetweenMsrs toggles, and the bank the cost model's Target-level MsrA/MsrB resources (the latch ports) are charged against.


Struct Layouts

The two placement functions read their MrbEntry inputs off the MrbChainAllocator's per-chunk state, and the split logic stores its tail records in a btree. Three byte-exact layouts, decoded from ReleaseMrbReservation (0x10f5f9e0) and SplitAccumulationChain (0x10f598e0):

MrbChunkState — Two Distinct Per-Chunk Arrays

The allocator carries two flat arrays keyed by MrbEntry.chunk_id, not one. They serve different purposes and have different strides:

(a) Per-MXU reservation-timeline arrayMrbChainAllocator[+0xf0], allocated clamp(num_mxus, ≥8 when ver≥5) * 0x70 via __size_returning_new, stride 0x70, indexed chunk_id × 0x70 (SplitAccumulationChain 0x10f59941: v13 = base + 112 * chunk_id). Each element is its own boost::multi_index bimap (the chains_unevictable_until_ timeline of the chain allocator) — i.e. the reservation timeline is per-MXU, not one global structure. The per-element ctor (r14 += 0x70 per iteration):

FieldOffsetValue / role
multi_index self/sentinel+0x00= &[+0x18]
node-header ptr+0x10operator new(0xa8)
ordered-index head sentinel+0x18= self
size+0x280
bucket count+0x380x36 = 54
bucket array+0x40operator new(0x1b0) = 54 * 8
max load factor+0x48(int)(54 * 1.0f) (0x3f800000)

(The boost prime-bucket sizes {53,97,193,389,769,…} are at 0xac092a0; max load factor 1.0f.)

(b) Per-chunk free-entry poolMrbChainAllocator[+0x7a0] with an SOO bit at MrbChainAllocator[+0x798], a flat array of 0x40-byte records, indexed chunk_id × 0x40 (ReleaseMrbReservation 0x10f5fa18: v7 = (int16)chunk_id << 6). Each record is an absl::container_internal::raw_hash_set whose element is a linked_hash_set<MrbEntry> (policy @0x2181c730):

FieldOffsetRole
capacity mask+0x00hash-set capacity
size / SOO-state+0x08element count + SOO flag
ctrl ptr+0x10control-byte array
slots ptr+0x18slot array
SOO inline slot+0x20single-element fast path
free-list head+0x28doubly-linked recycled-entry list
list count+0x38recycled-entry count

ReleaseMrbReservation hashes the MrbEntry (MixingHashState::kSeed CRC32 over {uint16 chunk_id, int fifo_index, byte format}), finds/inserts it in the chunk's set, then operator new(0x20)s a node {+0x10 = MrbEntry key, +0x18 = format byte} and pushes it onto the free-list ([record+0x28] head, ++[record+0x38]). This recycles a freed FIFO entry back into the per-chunk pool.

AccumulationChainAfterSplit (0x70 bytes)

When SplitAccumulationChain cuts a chain at the current program order, the tail is deferred as a unique_ptr<AccumulationChainAfterSplit> keyed by program order into a btree_set_container<btree<map_params_impl<long, unique_ptr<…>>>> (root at MrbChainAllocator[+0x70…], insert @0x10f5d2e0). The record:

FieldOffsetTypeMeaning
span base+0x00AccumulationWithOriginalChain*tail span begin
split program order+0x08s64compared vs alloc[+0x70] current_time (0x10f598ff)
span ptr+0x10AccumulationWithOriginalChain*cut cursor; += consumed * 0x60 (AWOC stride 0x60)
span len+0x18s64-= consumed = current_order − begin
MrbEntry.chunk_id+0x20s16read at 0x10f599e2 / AdvanceTimeTo
MrbEntry.fifo_index+0x28s32(also at +0x24)
has-MRB-entry flag+0x2cbytegates ReleaseMrbReservation (0x10f5990d/0x10f599d6)
matres program order+0x68s64the bimap LEFT key (unevictable-until)

The span cut is lea r14,[rax+rax*2]; shl r14,5 (0x10f59ab6) = consumed * 0x60, so the AccumulationWithOriginalChain element stride is 0x60. The btree node built at operator new(0x30) writes {+0x00 span_base, +0x08 latest_matmul/program_order, +0x10 span_ptr, +0x18 len}. (Source: mxu_accumulation.cc:1017 "Cannot split an AccumulationChain in the past", :1033 "Freeing MRB entry …".)


The Downstream Consumer — set_mrb_address

The FIFO slot AllocateMrbEntriesAsFifo computes is committed by LloInstruction::set_mrb_address(int) (0x1d4d93c0), which gates on MRB support, bounds-checks, then delegates to set_mrb_address_unrestricted:

// LloInstruction::set_mrb_address(this, addr) @0x1d4d93c0
target = this->parent()->target();
if (target.MatmulResultBufferEntries() <= 0)            // vtbl[+0x390] gate
    UpdateStatus("parent()->target() supports …");      // llo_instruction.cc:3660
RET_CHECK(addr < ResultFifoEntryCount(kMrf0, target.DeepseaVersion()))   // :3663
        // "Expected MRB address in the range 0-512, found: <addr>"
return set_mrb_address_unrestricted(this, addr);        // @0x1d4e9780

// set_mrb_address_unrestricted(this, addr) @0x1d4e9780
RET_CHECK(addr >= 0);                                   // :3642 "mrb_address >= 0"
if ((uint16)(op - 0x9b) < 0xB || (op & 0xFFFE) == 0x152) // matmul 0x9b..0xa5, or matres 0x152/0x153
    *(uint16*)(this + 0x42) = addr;                     // ← the physical MRB-address field
else
    FATAL("Unsupported opcode: <op>");                  // :3655

So the placed slot lands in a 16-bit field at LloInstruction+0x42, valid for the matmul family (0x9b..0xa5) and the matres family (0x152/0x153); mrb_address() (@0x1d4e8860) reads it back (lea ecx,[op-0x9b]; cmp cx,0xb). The per-generation Encoder*::EncodeBundle then serializes that field — and the MSR byte from set_matrix_staging_register — into the bundle word (see the MXU slot and encoder latch serialization pages for the bit positions).

GOTCHA — the error string says "range 0-512" but the actual bound is per-version. The hard bound is ResultFifoEntryCount(kMrf0, version){16,16,16,48,224,256}not a fixed 512. The "0-512" literal is a stale message; the runtime check compares against the per-gen table value. A reimplementation that hardcodes 512 will accept out-of-range FIFO addresses that the binary rejects (and reject valid ones on the larger gens only if it clamps low).


Why the Path Is Gen-Gated Off in v0.0.40

NOTE — every TensorCore Target in this build returns MatmulResultBufferEntries() == 0. JellyfishTarget (0x1d4902c0), PufferfishTarget (0x1d493b…), ViperfishTarget (0x1d49ac60), and GhostliteTarget all return 0 (xor eax,eax; ret), and the base Target::MinMrbWriteBlockSize / MrbRelativeAddressFromOffset are LogFatal stubs. No subclass compiled into v0.0.40 overrides them to a live value. Consequently set_mrb_address's MatmulResultBufferEntries() > 0 gate fails and the chain allocator's "Cannot use MRB" guard fires — the FIFO-address path runs with MRB disabled. The arithmetic, the cursor recurrence, the accessor surface, and both struct layouts are byte-exact CONFIRMED; only the concrete numeric geometry (the live MinMrbWriteBlockSize granule and MrbRelativeAddressFromOffset map) for the gen that enables MRB is absent from this wheel — it awaits a newer-gen Target. Notably MxuResultEntriesPushed/Popped are overridden per-gen (the Viperfish {2,4,8,8,4,4,4,4} push table) even with MatmulResultBufferEntries == 0.


Confidence Summary

ClaimEvidence
Two per-MXU int32 cursors (write/read), sized num_mxus = Target[+0x4ac], memset 00x10f3efc4/0x10f3efe0; num_mxus cross-checked vs LatchLhs 0x10f3b791
Per-MXU select unit_id & 3, gated by the 0x400 "has mxu" bit on WORD[+0xb]0x10f3f069
Advance by MxuResultEntriesPushed/Popped, round up to MinMrbWriteBlockSize, wrap mod ResultFifoEntryCountidiv @0x10f3f218, mod @0x10f3f234; vtbl +0x5f0/+0x5f8/+0x5e0
FIFO depth {16,16,16,48,224,256} per versiontable @0xb53e240 (struct.unpack direct ELF read)
Physical write via set_mrb_addressset_mrb_address_unrestricted into +0x420x1d4d93c0 / 0x1d4e9780 decompiled
0xa5 pre-scan aborts the whole quadrant bounce0x10f3fb34 (cmp …,165)
Per-sequence msr ^= 1 toggle from initial msr = 10x10f3fb60
MSR byte written to +0x46/+0x44/+0x42/+0x41 by op familyset_matrix_staging_register 0x1d4d7d40 decompiled
Bounce stamps latches (+0x18) and matmuls[0] (+0x48), not matres[seq+0x50] guarded by "!sequence->matmuls.empty()" :1063
Msr{0}="msra", Msr{1}="msrb"MsrToString 0x1d629720 arithmetic
Per-MXU timeline element stride 0x70; per-chunk free-pool record stride 0x400x10f59941 (×0x70), 0x10f5fa18 (<<6)
AccumulationChainAfterSplit 0x70-byte field map; AWOC stride 0x60SplitAccumulationChain 0x10f598e0; 0x10f59ab6
Six Target accessors at +0x390/+0x398/+0x5e0/+0x5e8/+0x5f0/+0x5f8vtable @0x21cce6a0, .rela.dyn-resolved
MRB path gen-gated off (all shipped Targets return MatmulResultBufferEntries==0)JellyfishTarget 0x1d4902c0 etc. (xor eax,eax; ret)
MrbRelativeAddressFromOffset is the matres-side relative-address mapvtbl +0x5e8 used at 0x10f3f313; base is LogFatal
Per-MXU cursors reset per quadrant call (no cross-region carry)cursor new/free bracket the function body

ComponentRelationship
MrbChainAllocator::SplitAccumulationChain 0x10f598e0source of the 0x70-byte split record + per-MXU timeline index
MrbChainAllocator::ReleaseMrbReservation 0x10f5f9e0the 0x40-byte free-pool record and MrbEntry recycle
MxuAssigner::VisitRegion 0x10f3a640the per-quadrant array<Span,4> dispatch of both functions
MxuAssigner::SetLatchIndices 0x10f3b4c0 / LatchLhs 0x10f3b5e0 (num_mxus read @0x10f3b791)the LHS gain-matrix latch side (runs before this)
LloInstruction::mrb_address 0x1d4e8860reads back the +0x42 field this writes
ViperfishTarget::MxuResultEntriesPushed/Popped 0x1d499ae0/0x1d499ba0the per-gen push/pop counts the cursors advance by

Cross-References

  • MRB Chain Allocator — the program-order reservation timeline that hands each chain the MrbEntry this page turns into a physical FIFO slot + MSR bank; owns the per-MXU boost::multi_index carried in the 0x70-byte timeline element.
  • MxuSequence / SequenceInfo — the per-sequence record whose latches (+0x18), matmuls (+0x48), and matreses (+0x60) these two functions index.
  • MXU Assignment Bin-PackerAssignMxusForSequenceGroup, which builds the MxuSequences placed here.
  • Latch Assignment & Overrun — the SetLatchIndices/LatchLhs gain-latch side that runs before the FIFO/MSR placement in VisitRegion.
  • TPU Scheduling Pipeline — Stage 2's place in the four-stage scheduler stack; this page is the result-FIFO/MSR row of Stage 2.
  • MXU Slot — the bundle slot whose mrb_address (+0x42) and matrix-staging-register fields are serialized from the values placed here; the GainLatchMode MSR-A/MSR-B bank model.
  • Matprep / IAR / Latch — the latch ops (0x8d..0x96) the MSR bounce stamps.
  • LLO Opcode Enum — the 0x9b..0xa5 matmul, 0x8d..0x96 latch (kVectorLatch*), 0xaa/0xab load-lmr, 0xa8 done-with-gains, 0x152 matres numeric space.
  • MXU Latency Overview — the cost model that consumes the placed MSR bank (the MatpushModifier staging-register key) to charge per-bank reservation cycles.
  • MatmulMode & Modifiers — the MatmulDataFormat/GainLatchMode ordinals keying MxuResultEntriesPushed/Popped.
  • MXU Latency: VF — the Viperfish reservation matrix indexed by the bank/format the bounce assigns.
  • Binary: extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so (build-id 89edbbe81c5b328a958fe628a9f2207d)
  • Index entry: Part VIII — Instruction Scheduling & Bundle Packing — back to index