MRB FIFO / MSR Placement
Addresses apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (BuildID md589edbbe81c5b328a958fe628a9f2207d, not stripped — every symbol below is a demangled C++ name). Section map:.text/.rodataVMA == file offset;.data.rel.roVMA − 0x200000 == file offset. Other versions differ.
Abstract
This is the physical-placement layer of Stage 2's matrix-resource assignment. The MRB chain allocator decides, on a program-order reservation timeline, which matrix-result-buffer (MRB) chunk and FIFO index each matmul accumulation chain occupies — it hands each chain a MrbEntry = {chunk_id, fifo_index, format}. Two MxuAssigner methods then turn those abstract reservations into the two concrete attributes the per-generation encoder serializes into the bundle word: AllocateMrbEntriesAsFifo (0x10f3ef80) assigns each matmul and each matres a physical result-FIFO slot address, and BounceBetweenMsrs (0x10f3fae0) assigns each MxuSequence a physical matrix-staging-register (MSR) bank — msra (0) or msrb (1). Both run once per MXU quadrant, dispatched from MxuAssigner::VisitRegion over an array<Span<MxuSequence>,4> (the four MXU instances of a region).
The familiar reference frame is a hardware ring buffer paired with a double-buffered staging latch. The result FIFO is a per-MXU circular buffer: AllocateMrbEntriesAsFifo maintains two running cursors per MXU — a write cursor (advanced on the matmul/push side) and a read cursor (advanced on the matres/pop side) — each advanced by the per-op entry count, rounded up to a write-block granule, and wrapped modulo the FIFO depth. The MSR bounce is the classic ping-pong of a weight-stationary array: while msra feeds one sequence's matmuls, the next sequence's gain matrix stages into msrb, then the banks swap; BounceBetweenMsrs realizes that by toggling a one-bit bank index at the top of every sequence and stamping it onto each op via set_matrix_staging_register. The one wrinkle is the LMR/XMR path: a vector-matmul-lmr (opcode 0xa5) reads its gain matrix directly from a load-matrix register rather than staging through an MSR, so the bounce is meaningless for it and the whole pass returns early.
This page documents the two placement functions byte-for-byte, the two struct layouts the allocator carries them on (MrbChunkState — actually two distinct per-chunk arrays — and the 0x70-byte AccumulationChainAfterSplit record), and the downstream consumer (LloInstruction::set_mrb_address → set_mrb_address_unrestricted, the matmul-family field at +0x42, read back by mrb_address). It does not re-derive the reservation timeline (that is the chain allocator) or the bundle bit positions of the MSR/MRB fields (those are the MXU slot and per-gen bundle pages).
For reimplementation, the contract is:
- The FIFO discipline — two
int32cursor arrays sized bynum_mxus, per-MXU selectunit_id & 3, advance byMxuResultEntriesPushed/Popped, round up toMinMrbWriteBlockSize, wrap moduloResultFifoEntryCount(kMrf0, version), committed byset_mrb_address. - The MSR bounce — the
0xa5early-out pre-scan, the per-sequencemsr ^= 1toggle, andset_matrix_staging_register's per-op-family destination byte (+0x46/+0x44/+0x42/+0x41). - The two struct layouts — the
0x70-byte per-MXU timeline element vs the0x40-byte per-chunk free-pool record, and the0x70-byteAccumulationChainAfterSplit, all keyed byMrbEntry.chunk_id. - The per-gen accessor surface — six
Targetvtable slots that price the placement, and why the path is gen-gated off in v0.0.40 (every shipped TensorCore target returnsMatmulResultBufferEntries() == 0).
| FIFO placement | MxuAssigner::AllocateMrbEntriesAsFifo(Target const&, Span<unique_ptr<MxuSequence>>) @0x10f3ef80 |
| MSR bounce | MxuAssigner::BounceBetweenMsrs(Target const&, Span<unique_ptr<MxuSequence>>) @0x10f3fae0 |
| Per-quadrant dispatch | MxuAssigner::VisitRegion @0x10f3a640 → array<Span,4> lambdas $_5 @0x10f40a80, $_6 @0x10f40b60 |
| Physical FIFO write | LloInstruction::set_mrb_address(int) @0x1d4d93c0 → set_mrb_address_unrestricted(int) @0x1d4e9780 (field +0x42) |
| MSR write | LloInstruction::set_matrix_staging_register(Msr) @0x1d4d7d40 (fields +0x46/+0x44/+0x42/+0x41) |
| FIFO depth | ResultFifoEntryCount(kMrf0=8, TpuVersion) @0x1d631520; table @0xb53e240 = {16,16,16,48,224,256} |
| Msr enum | MsrToString @0x1d629720: 0 → "msra", 1 → "msrb" |
| Carried on | MrbChainAllocator per-MXU timeline @alloc+0xf0 (stride 0x70), per-chunk free pool @alloc+0x7a0 (stride 0x40) |
| Source | platforms/xla/service/jellyfish/mxu_assigner.cc, mxu_accumulation.cc, llo_instruction.cc |
| Confidence | CONFIRMED (byte-anchored) unless a row or callout says otherwise |
The Per-Quadrant Dispatch
Purpose
Both placement functions are region-local and quadrant-local. MxuAssigner::VisitRegion (0x10f3a640) groups the region's MxuSequences by physical MXU into an array<Span<unique_ptr<MxuSequence>>,4> — one span per MXU instance — and runs that array through two InvokeObject lambdas: $_5 (0x10f40a80) calls AllocateMrbEntriesAsFifo four times (once per quadrant span, paired [rcx+0/0x10/0x20/0x30] pointer with [rcx+8/0x18/0x28/0x38] length); $_6 (0x10f40b60) calls BounceBetweenMsrs the same four times. Each lambda early-outs if a call returns a non-OK Status. The captured first word of each lambda is the Target& proxy.
Structure
MxuAssigner::VisitRegion 0x10f3a640
build array<Span<unique_ptr<MxuSequence>>, 4> ── one span per MXU instance (quadrant)
├─ $_5 InvokeObject(AllocateMrbEntriesAsFifo) 0x10f40a80
│ AllocateMrbEntriesAsFifo(target, quadrant[0]) 0x10f40abe
│ AllocateMrbEntriesAsFifo(target, quadrant[1]) 0x10f40ad2
│ AllocateMrbEntriesAsFifo(target, quadrant[2]) 0x10f40ae6
│ AllocateMrbEntriesAsFifo(target, quadrant[3]) 0x10f40afc
│ └─ early-out if Status != OK (1)
└─ $_6 InvokeObject(BounceBetweenMsrs) 0x10f40b60
BounceBetweenMsrs(target, quadrant[q]) × 4 0x10f40f91…
QUIRK — the FIFO cursors are per-MXU inside one quadrant call, not global.
AllocateMrbEntriesAsFifoallocates its two cursor arrays at function entry and frees them at function return, so each of the four quadrant calls starts both cursors at zero. The arrays are sizednum_mxus, and the per-element index comes fromunit_id & 3(the per-op MXU instance,HIBYTE(WORD[+0xb]) & 3), so within one call the cursors are independent per MXU. A reimplementation that hoists the cursors to region scope (carrying the FIFO write position across quadrants) will diverge — the binary resets them per call. Whether the hardware actually expects a fresh FIFO per quadrant region or a cross-region carry is not traced (the cursor lifetime is strictly intra-call here).
FIFO Placement — AllocateMrbEntriesAsFifo
Purpose
Walk one quadrant's MxuSequences in program order and assign every matmul and every matres a concrete result-FIFO slot address via set_mrb_address. The result FIFO is modelled as a per-MXU circular buffer of ResultFifoEntryCount(kMrf0, version) entries; matmuls push into it (advancing a write cursor) and matreses pop from it (advancing a separate read cursor).
Entry Point
AllocateMrbEntriesAsFifo(Target const& [rdi], Span<unique_ptr<MxuSequence>> [rsi=ptr, rdx=count])
num_mxus = Target[+0x4ac] ── 0x10f3ef9b (movsxd; same field as LatchLhs)
write_ptr[] = operator new(4 * num_mxus); memset 0 ── 0x10f3efc4 ([rbp-0x48], matmul/push side)
read_ptr[] = operator new(4 * num_mxus); memset 0 ── 0x10f3efe0 ([rbp-0x68], matres/pop side)
fifo_depth = ResultFifoEntryCount(8, Target[+0x398])── 0x10f3f019
for each MxuSequence in span: …
Algorithm
// MxuAssigner::AllocateMrbEntriesAsFifo @0x10f3ef80
// a1 = Target const& (the MxuAssigner `this` is dead; rdi carries the Target&)
// span = Span<unique_ptr<MxuSequence>> (one MXU quadrant)
function AllocateMrbEntriesAsFifo(target, span):
num_mxus = target[+0x4ac] // CONFIRMED Target field (== LatchLhs num_mxus read)
if num_mxus > 0:
write_ptr = new int32[num_mxus]; memset(write_ptr, 0) // matmul/push cursor, [rbp-0x48]
read_ptr = new int32[num_mxus]; memset(read_ptr, 0) // matres/pop cursor, [rbp-0x68]
fifo_depth = ResultFifoEntryCount(kMrf0 /*=8*/, target[+0x398] /*DeepseaVersion*/) // @0x1d631520
for seq in span:
RET_CHECK(!seq->matmuls.empty()) // mxu_assigner.cc:881 ([seq+0x50] count)
// per-MXU quadrant select: bits 8..9 of LloValue WORD[+0xb], gated by the 0x400 "has mxu" bit
seq_unit = (seq->matmuls[0].word[0xb] & 0x400) ? ((HIBYTE(word) & 3) | HAS_VALUE) : 0
RET_CHECK(seq->matmuls[0]->unit_id().has_value()) // mxu_assigner.cc:882
matres_index = 0
for matmul in seq->matmuls: // [seq+0x48] ptr, [seq+0x50] count
fmt = matmul.matmul_data_format() // @0x1d4e8440
push = target.MxuResultEntriesPushed(matmul_op, fmt) // vtbl[+0x5f0]
unit = matmul->unit_id()
RET_CHECK(unit.has_value()) // mxu_assigner.cc:890
RET_CHECK(unit == seq_unit) // mxu_assigner.cc:893 "same unit id"
mxu = unit & 3 // per-MXU index into the cursor arrays
// ---- THE PHYSICAL WRITE (matmul push side) ----
set_mrb_address(matmul, write_ptr[mxu]) // @0x1d4d93c0 ← FIFO slot
write_ptr[mxu] += push // advance by entries this matmul pushes
g = target.MinMrbWriteBlockSize() // vtbl[+0x5e0], a byte (idiv divisor)
write_ptr[mxu] = ceil_div(write_ptr[mxu], g) * g // round UP to the write-block granule
write_ptr[mxu] %= fifo_depth // wrap into the FIFO
// ---- THE MATRES POP SIDE (only when this matmul pushes entries) ----
if push > 0:
base = read_ptr[mxu]
acc = 0
while acc < push: // drain `push` worth of result entries
matres = seq->matreses[matres_index] // [seq+0x60] ptr, [seq+0x68] count
RET_CHECK(matres_index < seq->matreses.size()) // mxu_assigner.cc:902 / :913
RET_CHECK(matres->unit_id().has_value()) // mxu_assigner.cc:924
RET_CHECK(*matres->unit_id() == unit) // mxu_assigner.cc:925
rel = target.MrbRelativeAddressFromOffset(matres_op, acc) // vtbl[+0x5e8]
set_mrb_address(matres, (base + rel) % fifo_depth) // @0x1d4d93c0
acc += target.MxuResultEntriesPopped(matres_op, fmt) // vtbl[+0x5f8]
matres_index += 1
read_ptr[mxu] = ceil_div((read_ptr[mxu] + push), g_pop) * g_pop % fifo_depth // symmetric
RET_CHECK(matres_index == seq->matreses.size()) // mxu_assigner.cc:949 "too many matreses"
return OK
The round-up arithmetic is emitted as a signed idiv plus the standard "did we truncate toward zero" fixup, exactly as the decompiler renders it at 0x10f3f218:
// write_ptr[mxu] = ceil(write_ptr[mxu] / g) * g, for a signed dividend (0x10f3f1f8..0x10f3f234)
q = write_ptr[mxu] / g; // idiv
write_ptr[mxu] = g * (q + ((write_ptr[mxu] > g*q) & (q >= 0))); // bump quotient if there was a remainder
write_ptr[mxu] %= fifo_depth; // final wrap
GOTCHA — the matmul cursor and the matres cursor are independent.
write_ptr(matmul push) andread_ptr(matres pop) advance separately; the read cursor is not derived from the write cursor. The matres address is(read_base + MrbRelativeAddressFromOffset(op, acc)) % fifo_depth, whereaccaccumulatesMxuResultEntriesPoppeduntil it reaches the matmul'spushcount. A reimplementation that places matres results at the same slot as the producing matmul (assuming a single shared cursor) will mis-address every result-pop; the read/write skew is the result-FIFO occupancy model the chain allocator's reservation timeline prices.
The Six Target Accessors
AllocateMrbEntriesAsFifo reaches the per-generation geometry entirely through Target virtuals. These six slots resolve the formerly-inferred Target[+0x390]/[+0x5e0] slots that the chain allocator and the cost model left unnamed.
Target vtable slot | vptr offset | Base impl | Role on this path |
|---|---|---|---|
MatmulResultBufferEntries | +0x390 | 0x1d61da40 | MRB-support gate: set_mrb_address aborts unless > 0 |
SupportsMatmulResultPack | +0x398 | 0x1d61da80 | sibling; not on this hot path |
MinMrbWriteBlockSize | +0x5e0 | 0x1d490d80 (LogFatal) | the FIFO push round-up granule (idiv divisor) |
MrbRelativeAddressFromOffset | +0x5e8 | 0x1d490dc0 (LogFatal) | (op, offset) → relative MRB address on the pop side |
MxuResultEntriesPushed | +0x5f0 | 0x1d61e9c0 | entries a matmul pushes; advances write_ptr |
MxuResultEntriesPopped | +0x5f8 | 0x1d61ea00 | entries a matres pops; advances matres_index |
NOTE —
Target[+0x4ac]isnum_mxus,Target[+0x398]is the version.num_mxus(movsxd rcx,[rdi+0x4ac]) is read identically byMxuAssigner::LatchLhs(0x10f3b791,imul rcx, Target::ChunksPerTile());Target[+0x398]is thetpu::TpuVersion(an unsigned in[0,6)) the FIFO-depth table is indexed by. Both are CONFIRMED cross-function.
FIFO Depth Table
ResultFifoEntryCount(ResultFifo fifo, TpuVersion ver) (0x1d631520) is a switch over the fifo argument; the matmul-result FIFO kMrf0 = 8 lands in the arm that returns table[ver] from .rodata at 0xb53e240. Read directly from the ELF (struct.unpack):
| TpuVersion ordinal | Generation | ResultFifoEntryCount(kMrf0, ver) |
|---|---|---|
| 0 | Jellyfish | 16 |
| 1 | Dragonfish | 16 |
| 2 | Pufferfish | 16 |
| 3 | Viperfish | 48 |
| 4 | Ghostlite | 224 |
| 5 | 6acc60406 (TPU7x) | 256 |
The version ordinals ≥ 6 take the LogMessageFatal("invalid platform type:") arm; the table has exactly six live rows. This is the modulus every cursor wraps at.
MxuResultEntriesPushed / Popped — the Per-Op Entry Counts
The two count accessors are overridden per generation even on the gens that gate MRB off (below). The Viperfish overrides are the byte-exact reference:
// ViperfishTarget::MxuResultEntriesPushed @0x1d499ae0 (a2 = LloOpcode, a3 = MatmulDataFormat)
if a2 != 0xa5 /*vector-matmul-lmr*/:
CHECK(a3-1 < 8) // valid MatmulDataFormat
return push_table[a3-1] // @0xa2e6180 = {2,4,8,8,4,4,4,4}
else:
CHECK((a3-2) in {0,3,4,5,6}) // 0xa5 valid-format mask 0x79 (bits 0,3,4,5,6)
return lmr_push_table[a3-2] // @0xb5307c0 = {2,0,0,1,1,1,1,0}
// ViperfishTarget::MxuResultEntriesPopped @0x1d499ba0
return 2 - MatmulDataFormatIsIntegral(fmt) // @0x1d629240 : integral fmt → pop 1, float → pop 2
Both push tables were read directly from .rodata: 0xa2e6180 = {2,4,8,8,4,4,4,4} (indexed fmt-1) and 0xb5307c0 = {2,0,0,1,1,1,1,0} (the 0xa5 path, indexed fmt-2). The Jellyfish base (MxuResultEntriesPushed @0x1d492360) is the small fmt1→1, fmt2→2 rule; popped (@0x1d492420) is fmt1/2→1.
Result Checks
AllocateMrbEntriesAsFifo is dense with RET_CHECKs (mxu_assigner.cc line numbers are the literal third arguments to RetCheckFailSlowPath):
| Line | Condition | Failure string |
|---|---|---|
| 881 | !sequence->matmuls.empty() | (the empty-matmuls guard at the top of each sequence) |
| 882 | sequence->matmuls[0]->unit_id().has_value() | sequence unit id present |
| 890 | matmul->unit_id().has_value() | per-matmul unit id present |
| 893 | unit_id == sequence_unit_id | "All matmuls in a sequence must have the same unit id." |
| 902 / 913 | matres_index < sequence->matreses.size() | matres index in range |
| 924 / 925 | matres->unit_id().has_value() / *matres->unit_id() == unit_id | matres belongs to the same MXU |
| 949 | matres_index == sequence->matreses.size() | "Sequence had too many matreses for the number of matmuls" |
| — | (matres-drain under-run) | "Sequence had too few matreses for the number of matmuls" (@0x8547285/@0x854724c) |
MSR Bounce — BounceBetweenMsrs
Purpose
Assign each MxuSequence in the quadrant a matrix-staging-register (MSR) bank — msra (0) or msrb (1) — and stamp it onto every staging op. Consecutive sequences alternate banks so that the gain matrix for sequence n+1 can stage into one MSR while sequence n's matmuls drain from the other — the double-buffered weight-stationary ping-pong. The MSR each op carries is exactly the attribute the cost model's GetResourceUsage reads to charge the per-bank latch reservation.
Algorithm
// MxuAssigner::BounceBetweenMsrs @0x10f3fae0 (a1 = Target const&; span = quadrant)
function BounceBetweenMsrs(target, span):
// ---- PRE-SCAN: abort if any sequence uses the LMR/XMR matmul path ----
for seq in span:
RET_CHECK(!seq->matmuls.empty()) // mxu_assigner.cc:1063 ([seq+0x50])
for op in seq->matmuls: // [seq+0x48] ptr, [seq+0x50] count
if op.opcode == 0xa5 /*vector-matmul-lmr*/:
return OK // 0xa5 reads gains from LMR, bypasses the MSRs
// ---- BOUNCE: toggle the bank at the top of every sequence ----
msr = 1
for seq in span:
msr ^= 1 // alternate: msra(0) ↔ msrb(1)
for latch in seq->latches: // [seq+0x18] ptr, [seq+0x20] count (0x8d..0x96)
set_matrix_staging_register(latch, msr) // @0x1d4d7d40
// first matmul of the sequence also carries the bank
set_matrix_staging_register(seq->matmuls[0], msr) // [seq+0x48][0]
return OK
NOTE — the bank is stamped onto the latch list and
matmuls[0], not onto a matres. The bounce writes the bank onto the sequence's latch list ([seq+0x18]/[seq+0x20], the stationary-operand staging ops0x8d..0x96) and ontomatmuls[0]([seq+0x48][0]). The[seq+0x48]/[seq+0x50]array is the matmuls vector, pinned by theRET_CHECK(!sequence->matmuls.empty())string atmxu_assigner.cc:1063guarding[seq+0x50]. The fullMxuSequencesub-vector map (cross-checked against the MxuSequence struct deleter andSetLatchIndices) is: latches+0x18/+0x20, matmuls+0x48/+0x50, matreses+0x60/+0x68.
Why 0xa5 Aborts the Whole Pass
A vector-matmul-lmr (0xa5) reads its stationary operand from a load-matrix register (the Xmr space), not from one of the two MSR staging banks. The Xmr enum is {0 → "gmra", 1 → "lmr"} (XmrToString @0x1d629740), and XmrToMsr (@0x1d629780) returns OK only for gmra (0) — converting lmr (1) to an MSR is the error path "Cannot convert Xmr %s to Msr." (matrix_register.cc:100). So an 0xa5 op has no MSR to bounce, and the whole quadrant's bounce is skipped rather than partially applied.
QUIRK — one
0xa5anywhere in the quadrant suppresses the bounce for all sequences. The pre-scan walks every matmul of every sequence first; a single0xa5makesBounceBetweenMsrsreturn before assigning any bank. A reimplementation that bounces sequence-by-sequence and only skips the0xa5sequences will assign MSR banks to the non-LMR sequences that the binary leaves unassigned.
set_matrix_staging_register — the Destination Bytes
The bank byte is written into a different per-op-family field depending on opcode. The dispatch is a cascade of range tests (LloOpcode is the 16-bit *a1):
// LloInstruction::set_matrix_staging_register(Msr) @0x1d4d7d40 (a2 = msr byte 0/1)
op = *a1; // LloOpcode (uint16)
if ((uint16)(op - 0x9b) <= 0xA): a1[+0x46] = msr; // 0x9b..0xa5 vector-matmul / -lmr
else if ((uint16)(op - 0x8d) <= 0x9): a1[+0x44] = msr; // 0x8d..0x96 vector-latch family (kVectorLatch*)
else if ((op & 0xFFFE) == 0xAA): a1[+0x42] = msr; // 0xaa/0xab vector-load-lmr / -bf16
else if (op == 0xA8): a1[+0x41] = msr; // 0xa8 vector-done-with-gains
else: FATAL("msr unsupported for opcode: "); // llo_instruction.cc:3414
LloOpcode range | Op family | Destination field |
|---|---|---|
0x9b..0xa5 | vector-matmul / vector-matmul-lmr | +0x46 |
0x8d..0x96 | vector-latch family (kVectorLatch*) | +0x44 |
0xaa, 0xab | vector-load-lmr / -lmr-bf16 | +0x42 |
0xa8 | vector-done-with-gains | +0x41 |
| else | — | FATAL |
The Msr Enum
MsrToString(Msr) (0x1d629720) writes a 4-byte string by integer arithmetic, confirming the enum is exactly two values:
// MsrToString @0x1d629720 (a2 = Msr ordinal)
*(uint32*)out = ((a2 != 0) << 24) + 0x6172736D; // 0x6172736D == "msra" (LE: m,s,r,a)
// a2 != 0 adds 0x01000000 → last byte 'a'(0x61) → 'b'(0x62) ⇒ "msrb"
out[4] = 0; out.len = 4;
So Msr{0} = "msra", Msr{1} = "msrb". This is the bank index BounceBetweenMsrs toggles, and the bank the cost model's Target-level MsrA/MsrB resources (the latch ports) are charged against.
Struct Layouts
The two placement functions read their MrbEntry inputs off the MrbChainAllocator's per-chunk state, and the split logic stores its tail records in a btree. Three byte-exact layouts, decoded from ReleaseMrbReservation (0x10f5f9e0) and SplitAccumulationChain (0x10f598e0):
MrbChunkState — Two Distinct Per-Chunk Arrays
The allocator carries two flat arrays keyed by MrbEntry.chunk_id, not one. They serve different purposes and have different strides:
(a) Per-MXU reservation-timeline array — MrbChainAllocator[+0xf0], allocated clamp(num_mxus, ≥8 when ver≥5) * 0x70 via __size_returning_new, stride 0x70, indexed chunk_id × 0x70 (SplitAccumulationChain 0x10f59941: v13 = base + 112 * chunk_id). Each element is its own boost::multi_index bimap (the chains_unevictable_until_ timeline of the chain allocator) — i.e. the reservation timeline is per-MXU, not one global structure. The per-element ctor (r14 += 0x70 per iteration):
| Field | Offset | Value / role |
|---|---|---|
| multi_index self/sentinel | +0x00 | = &[+0x18] |
| node-header ptr | +0x10 | operator new(0xa8) |
| ordered-index head sentinel | +0x18 | = self |
| size | +0x28 | 0 |
| bucket count | +0x38 | 0x36 = 54 |
| bucket array | +0x40 | operator new(0x1b0) = 54 * 8 |
| max load factor | +0x48 | (int)(54 * 1.0f) (0x3f800000) |
(The boost prime-bucket sizes {53,97,193,389,769,…} are at 0xac092a0; max load factor 1.0f.)
(b) Per-chunk free-entry pool — MrbChainAllocator[+0x7a0] with an SOO bit at MrbChainAllocator[+0x798], a flat array of 0x40-byte records, indexed chunk_id × 0x40 (ReleaseMrbReservation 0x10f5fa18: v7 = (int16)chunk_id << 6). Each record is an absl::container_internal::raw_hash_set whose element is a linked_hash_set<MrbEntry> (policy @0x2181c730):
| Field | Offset | Role |
|---|---|---|
| capacity mask | +0x00 | hash-set capacity |
| size / SOO-state | +0x08 | element count + SOO flag |
| ctrl ptr | +0x10 | control-byte array |
| slots ptr | +0x18 | slot array |
| SOO inline slot | +0x20 | single-element fast path |
| free-list head | +0x28 | doubly-linked recycled-entry list |
| list count | +0x38 | recycled-entry count |
ReleaseMrbReservation hashes the MrbEntry (MixingHashState::kSeed CRC32 over {uint16 chunk_id, int fifo_index, byte format}), finds/inserts it in the chunk's set, then operator new(0x20)s a node {+0x10 = MrbEntry key, +0x18 = format byte} and pushes it onto the free-list ([record+0x28] head, ++[record+0x38]). This recycles a freed FIFO entry back into the per-chunk pool.
AccumulationChainAfterSplit (0x70 bytes)
When SplitAccumulationChain cuts a chain at the current program order, the tail is deferred as a unique_ptr<AccumulationChainAfterSplit> keyed by program order into a btree_set_container<btree<map_params_impl<long, unique_ptr<…>>>> (root at MrbChainAllocator[+0x70…], insert @0x10f5d2e0). The record:
| Field | Offset | Type | Meaning |
|---|---|---|---|
| span base | +0x00 | AccumulationWithOriginalChain* | tail span begin |
| split program order | +0x08 | s64 | compared vs alloc[+0x70] current_time (0x10f598ff) |
| span ptr | +0x10 | AccumulationWithOriginalChain* | cut cursor; += consumed * 0x60 (AWOC stride 0x60) |
| span len | +0x18 | s64 | -= consumed = current_order − begin |
MrbEntry.chunk_id | +0x20 | s16 | read at 0x10f599e2 / AdvanceTimeTo |
MrbEntry.fifo_index | +0x28 | s32 | (also at +0x24) |
| has-MRB-entry flag | +0x2c | byte | gates ReleaseMrbReservation (0x10f5990d/0x10f599d6) |
| matres program order | +0x68 | s64 | the bimap LEFT key (unevictable-until) |
The span cut is lea r14,[rax+rax*2]; shl r14,5 (0x10f59ab6) = consumed * 0x60, so the AccumulationWithOriginalChain element stride is 0x60. The btree node built at operator new(0x30) writes {+0x00 span_base, +0x08 latest_matmul/program_order, +0x10 span_ptr, +0x18 len}. (Source: mxu_accumulation.cc:1017 "Cannot split an AccumulationChain in the past", :1033 "Freeing MRB entry …".)
The Downstream Consumer — set_mrb_address
The FIFO slot AllocateMrbEntriesAsFifo computes is committed by LloInstruction::set_mrb_address(int) (0x1d4d93c0), which gates on MRB support, bounds-checks, then delegates to set_mrb_address_unrestricted:
// LloInstruction::set_mrb_address(this, addr) @0x1d4d93c0
target = this->parent()->target();
if (target.MatmulResultBufferEntries() <= 0) // vtbl[+0x390] gate
UpdateStatus("parent()->target() supports …"); // llo_instruction.cc:3660
RET_CHECK(addr < ResultFifoEntryCount(kMrf0, target.DeepseaVersion())) // :3663
// "Expected MRB address in the range 0-512, found: <addr>"
return set_mrb_address_unrestricted(this, addr); // @0x1d4e9780
// set_mrb_address_unrestricted(this, addr) @0x1d4e9780
RET_CHECK(addr >= 0); // :3642 "mrb_address >= 0"
if ((uint16)(op - 0x9b) < 0xB || (op & 0xFFFE) == 0x152) // matmul 0x9b..0xa5, or matres 0x152/0x153
*(uint16*)(this + 0x42) = addr; // ← the physical MRB-address field
else
FATAL("Unsupported opcode: <op>"); // :3655
So the placed slot lands in a 16-bit field at LloInstruction+0x42, valid for the matmul family (0x9b..0xa5) and the matres family (0x152/0x153); mrb_address() (@0x1d4e8860) reads it back (lea ecx,[op-0x9b]; cmp cx,0xb). The per-generation Encoder*::EncodeBundle then serializes that field — and the MSR byte from set_matrix_staging_register — into the bundle word (see the MXU slot and encoder latch serialization pages for the bit positions).
GOTCHA — the error string says "range 0-512" but the actual bound is per-version. The hard bound is
ResultFifoEntryCount(kMrf0, version)—{16,16,16,48,224,256}— not a fixed 512. The "0-512" literal is a stale message; the runtime check compares against the per-gen table value. A reimplementation that hardcodes 512 will accept out-of-range FIFO addresses that the binary rejects (and reject valid ones on the larger gens only if it clamps low).
Why the Path Is Gen-Gated Off in v0.0.40
NOTE — every TensorCore
Targetin this build returnsMatmulResultBufferEntries() == 0.JellyfishTarget(0x1d4902c0),PufferfishTarget(0x1d493b…),ViperfishTarget(0x1d49ac60), andGhostliteTargetall return0(xor eax,eax; ret), and the baseTarget::MinMrbWriteBlockSize/MrbRelativeAddressFromOffsetareLogFatalstubs. No subclass compiled into v0.0.40 overrides them to a live value. Consequentlyset_mrb_address'sMatmulResultBufferEntries() > 0gate fails and the chain allocator's "Cannot use MRB" guard fires — the FIFO-address path runs with MRB disabled. The arithmetic, the cursor recurrence, the accessor surface, and both struct layouts are byte-exact CONFIRMED; only the concrete numeric geometry (the liveMinMrbWriteBlockSizegranule andMrbRelativeAddressFromOffsetmap) for the gen that enables MRB is absent from this wheel — it awaits a newer-genTarget. NotablyMxuResultEntriesPushed/Poppedare overridden per-gen (the Viperfish{2,4,8,8,4,4,4,4}push table) even withMatmulResultBufferEntries == 0.
Confidence Summary
| Claim | Evidence |
|---|---|
Two per-MXU int32 cursors (write/read), sized num_mxus = Target[+0x4ac], memset 0 | 0x10f3efc4/0x10f3efe0; num_mxus cross-checked vs LatchLhs 0x10f3b791 |
Per-MXU select unit_id & 3, gated by the 0x400 "has mxu" bit on WORD[+0xb] | 0x10f3f069 |
Advance by MxuResultEntriesPushed/Popped, round up to MinMrbWriteBlockSize, wrap mod ResultFifoEntryCount | idiv @0x10f3f218, mod @0x10f3f234; vtbl +0x5f0/+0x5f8/+0x5e0 |
FIFO depth {16,16,16,48,224,256} per version | table @0xb53e240 (struct.unpack direct ELF read) |
Physical write via set_mrb_address → set_mrb_address_unrestricted into +0x42 | 0x1d4d93c0 / 0x1d4e9780 decompiled |
0xa5 pre-scan aborts the whole quadrant bounce | 0x10f3fb34 (cmp …,165) |
Per-sequence msr ^= 1 toggle from initial msr = 1 | 0x10f3fb60 |
MSR byte written to +0x46/+0x44/+0x42/+0x41 by op family | set_matrix_staging_register 0x1d4d7d40 decompiled |
Bounce stamps latches (+0x18) and matmuls[0] (+0x48), not matres | [seq+0x50] guarded by "!sequence->matmuls.empty()" :1063 |
Msr{0}="msra", Msr{1}="msrb" | MsrToString 0x1d629720 arithmetic |
Per-MXU timeline element stride 0x70; per-chunk free-pool record stride 0x40 | 0x10f59941 (×0x70), 0x10f5fa18 (<<6) |
AccumulationChainAfterSplit 0x70-byte field map; AWOC stride 0x60 | SplitAccumulationChain 0x10f598e0; 0x10f59ab6 |
Six Target accessors at +0x390/+0x398/+0x5e0/+0x5e8/+0x5f0/+0x5f8 | vtable @0x21cce6a0, .rela.dyn-resolved |
MRB path gen-gated off (all shipped Targets return MatmulResultBufferEntries==0) | JellyfishTarget 0x1d4902c0 etc. (xor eax,eax; ret) |
MrbRelativeAddressFromOffset is the matres-side relative-address map | vtbl +0x5e8 used at 0x10f3f313; base is LogFatal |
| Per-MXU cursors reset per quadrant call (no cross-region carry) | cursor new/free bracket the function body |
Related Components
| Component | Relationship |
|---|---|
MrbChainAllocator::SplitAccumulationChain 0x10f598e0 | source of the 0x70-byte split record + per-MXU timeline index |
MrbChainAllocator::ReleaseMrbReservation 0x10f5f9e0 | the 0x40-byte free-pool record and MrbEntry recycle |
MxuAssigner::VisitRegion 0x10f3a640 | the per-quadrant array<Span,4> dispatch of both functions |
MxuAssigner::SetLatchIndices 0x10f3b4c0 / LatchLhs 0x10f3b5e0 (num_mxus read @0x10f3b791) | the LHS gain-matrix latch side (runs before this) |
LloInstruction::mrb_address 0x1d4e8860 | reads back the +0x42 field this writes |
ViperfishTarget::MxuResultEntriesPushed/Popped 0x1d499ae0/0x1d499ba0 | the per-gen push/pop counts the cursors advance by |
Cross-References
- MRB Chain Allocator — the program-order reservation timeline that hands each chain the
MrbEntrythis page turns into a physical FIFO slot + MSR bank; owns the per-MXUboost::multi_indexcarried in the0x70-byte timeline element. - MxuSequence / SequenceInfo — the per-sequence record whose latches (
+0x18), matmuls (+0x48), and matreses (+0x60) these two functions index. - MXU Assignment Bin-Packer —
AssignMxusForSequenceGroup, which builds theMxuSequences placed here. - Latch Assignment & Overrun — the
SetLatchIndices/LatchLhsgain-latch side that runs before the FIFO/MSR placement inVisitRegion. - TPU Scheduling Pipeline — Stage 2's place in the four-stage scheduler stack; this page is the result-FIFO/MSR row of Stage 2.
- MXU Slot — the bundle slot whose
mrb_address(+0x42) and matrix-staging-register fields are serialized from the values placed here; theGainLatchModeMSR-A/MSR-B bank model. - Matprep / IAR / Latch — the latch ops (
0x8d..0x96) the MSR bounce stamps. - LLO Opcode Enum — the
0x9b..0xa5matmul,0x8d..0x96latch (kVectorLatch*),0xaa/0xabload-lmr,0xa8done-with-gains,0x152matres numeric space. - MXU Latency Overview — the cost model that consumes the placed MSR bank (the
MatpushModifierstaging-register key) to charge per-bank reservation cycles. - MatmulMode & Modifiers — the
MatmulDataFormat/GainLatchModeordinals keyingMxuResultEntriesPushed/Popped. - MXU Latency: VF — the Viperfish reservation matrix indexed by the bank/format the bounce assigns.
- Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part VIII — Instruction Scheduling & Bundle Packing — back to index