Hardware Loop-Counter
Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel. Other versions differ.
Abstract
A TPU loop is counted in one of two places, and the choice is per-sequencer, not per-program. On BarnaCore (and the Jellyfish / Pufferfish AddressHandler) the loop is a true hardware loop: a dedicated silicon register holds the trip count, the hardware decrements and tests it, and the bundle stream contains explicit loop-setup / loop-start / loop-end ops but no software induction variable, no compare, no branch. The AddressHandler hardware loop is live only on v3/v4 — the ViperfishTarget override is a dead stub (below) and v5p/v6e have no override at all. On the TensorCore and SparseCore the loop is a software back-edge built from the scalar-ALU op set — ADDri IV update, a CMPxx, and a BRcond — with the hardware Loop Counter (LCC) register mirroring the iteration count so the body can read it (for iteration-dependent addressing) without keeping its own counter. The LCC drives nothing; it is a readable snapshot.
The loop-counter register is named LCC throughout the binary, a 64-bit value read as two 32-bit halves and unified with the global time counter (GTC) under one enum (CycleCounterType { kLCC, kGTC }). It is not an allocatable register — there is no LCC class in the register file alongside S/V/M/P0..P31; like GTC it is a special control register read into a scalar destination. The number of addressable LCC registers is the per-generation count this page pins: Jellyfish exposes none on its TensorCore (loops live only in the AddressHandler Loop slot); Pufferfish exposes two (LCC0, LCC1) through an indexed read-register enum; Viperfish, Ghostlite, and Trillium expose one implicit counter through a pair of dedicated dest-only opcodes.
This page documents the loop counter as a reimplementation target: the per-generation count and read mechanism, the encoding of the two loop mechanisms (hardware-counted vs software-counted-with-readback), the AddressHandler single-active-loop builder with its even-offset / minimum-length constraints, the two LLVM PipelinerLoopInfo subclasses that prove the hardware-vs-software split, and how the sequencer drives the back-edge. The op inventory and the lane it occupies are on the Sequencer Slot page; this page owns the counter itself.
For reimplementation, the contract is:
- LCC is a 64-bit special control register, read as low+high 32-bit halves into a 5-bit-selected scalar destination; max representable trip count
2^64 − 1. It is not in the allocatable register file. - The count per generation: JF TC = 0 (AddressHandler
Loopslot only), PF = 2 (LCC0/LCC1 via theTcs/Bcsread-register enum), V5+ = 1 implicit (viaReadRegisterLcc{Low,High}, a dest-only opcode with no index operand). - Two mechanisms: hardware-counted (BarnaCore / AddressHandler —
getIVUpdate/getCmpare NULL,getTripCountis−1) vs software-counted (TC / SparseCore — real IV/cmp/trip-count MIs), selected byanalyzeLoopForTPUPipeliningon a subtarget feature bit plus the terminator opcode. - The AddressHandler hardware loop is single-active (one
loop_start_field, not a stack), body length≥ 2instructions, with a mandatory non-loop preheader instruction. - Nesting is constrained: one active hardware loop per sequencer; PF's two LCC registers permit reading a depth-2 nest; outer loops are software back-edges.
| Counter register | LCC ("Loop Counter") — 64-bit, read as lo+hi 32-bit halves |
| Counter enum | LloInstruction::CycleCounterType { kLCC, kGTC } (shares the read datapath with GTC) |
| PF read mechanism | indexed TcsReadRegister / BcsReadRegister enum (LCC0, LCC1, …) |
| V5+ read mechanism | ReadRegisterLccLow / ReadRegisterLccHigh — dest-only opcodes, no index |
| V5+ dest encoder | TensorCoreScalarAlu0Compact_ReadRegisterLccLow::set_dest @ 0x1f62ff60 → BitCopy(buf,467,&dest,0,5) |
| HW-vs-SW selector | TPUInstrInfo::analyzeLoopForTPUPipelining @ 0x13b804c0 (feature bit [subtarget+0x158]&1 + terminator opcode) |
| HW loop-info class | (anon)::TPUBarnaCorePipelinerLoopInfo — getIVUpdate=NULL @ 0x13b86560, getTripCount=−1 @ 0x13b865a0 |
| SW loop-info class | (anon)::TPUSparseCorePipelinerLoopInfo — getTripCount=[this+0x28] @ 0x13b86260 |
| AddressHandler builder | AddressHandlerProgramBuilder::BeginLoop @ 0xfa90d40 / EndLoop @ 0xfa91300 |
| AddressHandler per-gen insert | JellyfishTarget::InsertAddressHandlerLoop @ 0x1d490e00 (live) / PufferfishTarget @ 0x1d495340 (live) / ViperfishTarget @ 0x1d49b980 (__noreturn stub) |
| BarnaCore HW opcodes | bcLOOP_SETUP (0x194), bcLOOP_START (0xf8a), bcLOOP_END (0x193) |
| LLO loop kinds | LloLoopKindProto { LOOP_KIND_NONE, LOOP_KIND_WHILE, LOOP_KIND_DOWHILE } |
The LCC Register and Its Width
The hardware loop counter is named LCC ("Loop Counter") everywhere in the binary and is a 64-bit value. It is never read whole: the program reads it as two 32-bit halves (rdreg.lcc.lo + rdreg.lcc.hi on the TensorCore; ReadRegisterLccLow + ReadRegisterLccHigh on V5+), exactly mirroring the global time counter (GTC), which is read the same lo+hi way. The LLO IR unifies the two under a single enum — LloInstruction::CycleCounterType { kLCC, kGTC } — and they share the read-register datapath. LCC is the per-loop iteration counter; GTC is the global time counter.
The crucial reimplementation fact: LCC is not an allocatable register. The TPU register file has S / V / M / P0..P31 / XRF / DRF / ERF / SFRF / V2SF / CB / VAGG classes and no LCC class. LCC, like GTC, is a special control register that the sequencer reads into a scalar destination S0..S31. Each half lands in one 32-bit scalar register; the destination is a 5-bit-selected scalar index, confirmed byte-exact from the V5+ encoder:
// gfc::isa::TensorCoreScalarAlu0Compact_ReadRegisterLccLow::set_dest(unsigned) @ 0x1f62ff60
int dest = a2; // the scalar destination index
return BitCopy(buf, /*dst_bit=*/467, &dest, /*src_bit=*/0, /*width=*/5); // 5-bit dest, abs bit 467
BitCopy(buf, 467, &dest, 0, 5) writes the 5-bit destination at absolute bundle bit 467 (0x1d3) — and it is the only field this op writes. All bit positions on this page are LSB-first (bit 0 = least-significant bit of byte 0), matching the universal BitCopy(dst, dst_bit, src, src_bit, nbits) packer (0x1fa0a900) and the convention pinned in Bundle Model §bit-numbering. This is the same 5-bit dest at bit 467 that the Sequencer Slot call/return-address encoder writes. There is no loop-counter-index operand, which is the structural proof that V5+ exposes a single implicit counter (see Per-Generation Count). The maximum trip count the counter can represent is 2^64 − 1; the 64-bit readback (lo+hi) is fixed, while whether the silicon down-counter is the full 64 bits or narrower is not separable from the binary.
NOTE — the LLVM-generic
hardware-loop-counter-bitwidthcl::opt("Set the loop counter bitwidth") belongs to the target-independentHardwareLoopspass, not to the silicon LCC. The TPU LCC is fixed at 64-bit per the lo/hi read structure; do not conflate the two.
Per-Generation Count and Read Mechanism
The "count" is the number of addressable loop-counter registers a program can read. Two distinct register-naming conventions exist in the binary, and they reveal the count directly.
Pufferfish (v4) — TWO loop counters, explicitly indexed. The Pufferfish read-register set is an enum, TcsReadRegister (TensorCore Sequencer) and BcsReadRegister (BarnaCore Sequencer), whose values name two distinct loop counters and two distinct time counters:
TcsReadRegister: BcsReadRegister:
TCS_READ_REGISTER_LCC0 BCS_READ_REGISTER_LCC0
TCS_READ_REGISTER_LCC1 BCS_READ_REGISTER_LCC1
TCS_READ_REGISTER_GTC0 BCS_READ_REGISTER_GTC0
TCS_READ_REGISTER_GTC1 BCS_READ_REGISTER_GTC1
TCS_READ_REGISTER_TAG_REGISTER BCS_READ_REGISTER_TAG_REGISTER
TCS_READ_REGISTER_TRACEMARK_REG BCS_READ_REGISTER_TRACEMARK_REG / FSR / HDR
The active counter is selected by the reg enum field of the scalar read-register op, not by a dedicated opcode — which is why no pxc::isa::*ReadRegisterLcc* function exists in the binary (the nm set has none). A reader of LCC0 vs LCC1 is choosing an enum value, not a different instruction.
Viperfish (v5p) / Ghostlite (v6e) / 6acc60406 (TPU7x) — ONE loop counter, implicit. V5+ replaced the indexed read with two dedicated opcodes per engine: ReadRegisterLccLow reads LCC[31:0], ReadRegisterLccHigh reads LCC[63:32], each into a scalar dest. The encoders exist for vxc, gxc::glc, and gxc::gfc on both ScalarAlu0 and ScalarAlu1 and on the SparseCore. As shown above, the op's only operand is the 5-bit destination — there is no index field, so the program can address exactly one (implicit) loop counter per sequencer.
Jellyfish (v2) / Dragonfish (v3) — NO LCC read on the TensorCore. No jellyfish::isa::*ReadRegisterLcc* symbol exists and no Jellyfish read-register enum names LCC; the JF TensorCore has only the cycle-counter reads. Jellyfish exposes a hardware loop only through the AddressHandler Loop slot (below), never via an LCC read on the TensorCore.
| Gen | TC LCC regs | SC/BCS LCC regs | Read mechanism |
|---|---|---|---|
| Jellyfish (v2) | 0 | n/a (BCAH Loop slot) | cycle-counter only on TC |
| Dragonfish (v3) | 0 (alias JF) | n/a | inherits Jellyfish |
| Pufferfish (v4) | 2 (LCC0/LCC1) | 2 (BCS LCC0/LCC1) | Tcs/Bcs read-register enum |
| Viperfish (v5p) | 1 (implicit) | 1 (implicit) | ReadRegisterLcc{Low,High} (dest-only) |
| Ghostlite (v6e) | 1 (implicit) | 1 (implicit) | ReadRegisterLcc{Low,High} (dest-only) |
| 6acc60406 (TPU7x) | 1 (implicit) | 1 (implicit) | ReadRegisterLcc{Low,High} (dest-only) |
GOTCHA — Pufferfish genuinely lacks the V5+
ReadRegisterLcc{Low,High}opcode form, but it still has two LCC registers: it reads them through the indexedTcsReadRegister/BcsReadRegisterenum instead. A reimplementation that drives off the V5+ opcode shape will miss the PF loop counters entirely.
The Two Loop Mechanisms
Which mechanism a loop uses is decided by the sequencer, and the split is proven by the two — and only two — LLVM PipelinerLoopInfo subclasses. There is no TensorCorePipelinerLoopInfo; the TensorCore software loop uses the SparseCore (software) info.
llvm::TPUPipelinerLoopInfo (base)
├─ (anon)::TPUBarnaCorePipelinerLoopInfo — hardware-counted (counter is in silicon)
└─ (anon)::TPUSparseCorePipelinerLoopInfo — software-counted (IV in the scalar register file)
The BarnaCore subclass returns nulls for every software-loop component — the clearest possible evidence that the hardware does the counting:
// (anon)::TPUBarnaCorePipelinerLoopInfo — decoded byte-exactly
getIVUpdate() @ 0x13b86560 -> return 0; // NULL — no software induction variable
getCmp() @ 0x13b86580 -> return 0; // NULL — no software compare
getTripCount() @ 0x13b865a0 -> return -1; // -1 = "no software trip count"
adjustTripCount @ 0x13b86500 -> ret; // no-op
The SparseCore subclass returns real machine instructions for each:
// (anon)::TPUSparseCorePipelinerLoopInfo — decoded byte-exactly
getIVUpdate() @ 0x13b86220 -> return *((void**)this + 3); // [this+0x18] = ADDri MI
getCmp() @ 0x13b86240 -> return *((void**)this + 2); // [this+0x10] = CMP/BRcond MI
getTripCount() @ 0x13b86260 -> return *((void**)this + 5); // [this+0x28] = trip-count MI
The selection happens in analyzeLoopForTPUPipelining (0x13b804c0). It first tests a subtarget feature bit ([*subtarget + 0x158] & 1 — byte offset 0x158 = 344; the BarnaCore hardware-loop mode); then it walks the loop's terminators looking for one of three loop-terminator opcodes (325, 403, 328). On a match it reads a per-loop HW-mode marker byte at [loop_header + 0x122] (= 290); when that byte is 1 it allocates the BarnaCore info via the operator new(8u) / vtable = off_2192D4E8 arm (mid-function, reached only after the terminator match); otherwise it detects an IV and compare and builds the 48-byte SparseCore info (operator new(0x30u)):
// llvm::TPUInstrInfo::analyzeLoopForTPUPipelining @ 0x13b804c0 (decoded byte-exactly)
subtarget = loop_header->parent_subtarget; // [a3+4]→[+32]
if ((subtarget->vtable_field[0x158] & 1) == 0) return 0; // not pipelineable on this target
for (term in loop terminators) { // walk the terminator chain
if (term.opcode == 325 || term.opcode == 403 || term.opcode == 328) {
if (loop_header.flags[+0x122] == 1) // BarnaCore HW-mode marker byte == 1
return new TPUBarnaCorePipelinerLoopInfo; // operator new(8); no IV/cmp/trip
// else: find predicate operand, detect IV update + compare (analyzeIVUpdateforPipelining)
...
return new TPUSparseCorePipelinerLoopInfo(iv, cmp, ...); // operator new(0x30)
}
}
NOTE — the three terminator opcodes (
325,403,328) and the predicate-operand opcode540are the raw MC opcode integersanalyzeLoopForTPUPipeliningmatches; their symbolic names were not resolved (the integers are byte-exact from the decompile — the names are not). The function additionally special-cases opcode540when extracting the predicate operand index.
So a BarnaCore loop body is: bcLOOP_SETUP (load the count) → bcLOOP_START (mark the body head; one value operand = the bound) → body → bcLOOP_END (the hardware decrement+test+branch-back terminator). A TC/SC loop body is: preheader (MOV/ADDri to init the IV) → body → ADDri (IV += stride) → CMPxx → BRcond back to the head, with the LCC register mirroring the count for in-body reads.
QUIRK — the BarnaCore loop index is itself a readable value, distinct from the LCC snapshot. The intrinsics
bc.extractvalue.loopindex(read the current index) andbc.insertvalue.loopindex(seed it) expose the live counter thatbcLOOP_ENDdecrements — a different datapath from theReadRegisterLccmirror used on the software-counted engines. A reimplementer must not assume "read the loop count" maps to the same op on both mechanisms.
How the Sequencer Drives the Loop
The loop-control op always occupies the scalar sequencer slot — lane 0 of the scalar-ALU sub-bundle, the only lane that can mutate the program counter (see Sequencer Slot). The two mechanisms place different ops there.
Software loop (TC / SparseCore). The back-edge is a BRcond (the conditional branch); the IV update (ADDri) and the compare (CMPxx) are scalar-ALU ops in the same lane family. The branch target is a signed 20-bit field in immediate slot 0, not inside the sequencer slot bytes; the absolute-vs-relative distinction is purely an opcode discriminator over that one field. The CMPxx family spans signed and unsigned comparisons in register-immediate (CMPxxri) and register-register (CMPxxrr) forms — the immediate form encodes a compile-time-constant trip bound, the register form a dynamic / SPU-computed bound. Because the back-edge is a branch, the loop body's last bundle is a branch-terminator bundle, which interacts with the branch-delay-slot packing.
Hardware loop (BarnaCore). bcLOOP_SETUP (opcode 0x194) loads the trip count into the loop register in the preamble; bcLOOP_START (0xf8a) marks the body head and carries one value operand (the bound — a register-materialized value that may originate from an immediate via SETUP); bcLOOP_END (0x193) is the back-edge terminator the hardware uses to decrement, test, and branch back. A BarnaCore channel-scalar slot carries a loop bit and a branch bit, and setting both is an encoding error — the diagnostic "Invalid Barnacore Channel Instruction with both Loop and Branch bits set" enforces that a slot is either a loop control or a branch, never both.
The trip count enters the LLO loop as a canonical index space: LloLoopProto.LoopIndexSpaceProto = { start, limit, step }. The limit may be a constant or a dynamic value; the binary tracks num_dynamic_loop_bounds_ (each dynamic bound occupying two slots, offset < num_dynamic_loop_bounds_ * 2), and a loop with num_dynamic_loop_bounds_ == 0 is fully static. The XLA-side WhileLoopBackendConfig { KnownTripCount, KnownInitStep, KnownInductionVariable } propagates the trip count down into this index space. The LLO pass (anon)::UpdateLoopCounter(LloRegion*) walks the region, finds the loop, and materializes the trip counter into the loop's carried tuple so the hardware LCC or the software IV can be seeded; it gives up with UnknownLoopCountComplexCFG when the CFG is too tangled to identify a single trip counter.
Zero-trip is handled at the loop-kind level, not by the counter: LloLoopKindProto = { LOOP_KIND_NONE, LOOP_KIND_WHILE, LOOP_KIND_DOWHILE }. WHILE tests before the body and can execute zero times; DOWHILE tests after and runs at least once. The MLIR property verify_non_zero_trip asserts a loop runs ≥ 1 iteration so it can lower as DOWHILE / a hardware-counted loop without a guard. Small constant-trip loops bypass the counter entirely — "Loops with a constant trip count smaller than this value will not use the count register" — and are unrolled or peeled instead.
The AddressHandler Hardware Loop (Jellyfish / Pufferfish)
Jellyfish has no TensorCore LCC, so its hardware loop lives entirely in the BarnaCore AddressHandler sequencer, built by AddressHandlerProgramBuilder::BeginLoop (0xfa90d40) / EndLoop (0xfa91300). The builder tracks a single loop-region field, loop_start_ (at this+0x18, the 7th int), with a kNoLoopActive = −1 sentinel — not a stack — which is the structural proof that AddressHandler loops cannot software-nest. The CHECKs are byte-exact:
// AddressHandlerProgramBuilder::BeginLoop @ 0xfa90d40 (decoded)
CHECK(loop_start_ == kNoLoopActive); // -1; fails if a loop is already active (line 903)
loop_start_ = instructions_.size(); // record the body head
CHECK(loop_start_ >= 1); // "Code must start with one non-loop instruction" (line 905)
// AddressHandlerProgramBuilder::EndLoop @ 0xfa91300 (decoded)
CHECK(loop_start_ != kNoLoopActive); // (line 910)
CHECK(loop_start_ >= 1); // "Code must start with one non-loop instruction" (line 911)
CHECK(loop_start_ < instructions_.size()); // (line 912)
loop_length = instructions_.size() - loop_start_;
CHECK(loop_length >= 2); // "Jellyfish spec requires that loop must have at least two instructions" (line 915)
insn = instructions_[loop_start_ - 1]; // the preheader instruction
CHECK(!insn.scalar.loop_start); // (line 917)
insn.scalar.loop_start = 1;
insn.scalar.loop_count = loop_length - 1; // body length minus the preheader slot
loop_start_ = kNoLoopActive; // reset; loop is closed
So an AddressHandler loop requires a mandatory non-loop preheader instruction (loop_start_ >= 1) and a body of at least two instructions (loop_length >= 2). EndLoop stamps the loop-start flag and the loop-count into the preheader instruction record, then resets loop_start_. The identical Pufferfish check string — "Pufferfish spec requires that loop must have at least two instructions" — confirms the same minimum-length rule carries to v4.
The loop body is not an offset field — the per-generation Target::InsertAddressHandlerLoop overrides write a count into a dedicated BarnaCoreAddressHandlerScalarSlot_Loop proto sub-message. JellyfishTarget::InsertAddressHandlerLoop (0x1d490e00) re-checks program_in_loop.bundles_size() >= 2 (the same "Jellyfish spec requires that loop must have at least two instructions" string, target_jellyfish.h:90), then default-constructs the Loop sub-message and stores loop_count = bundles − 1 (*(loop + 24) = v30 − 1) — the body-bundle count minus the preheader, identical to the loop_count = loop_length − 1 that EndLoop stamps. PufferfishTarget::InsertAddressHandlerLoop (0x1d495340) is the same shape with the "Pufferfish spec requires that loop must have at least two instructions" string and the same bundles − 1 count write.
GOTCHA — the diagnostics "loop end is out of range or not a positive multiple of 2" / "loop start is out of range or not a negative multiple of 2" belong to LLVM's bundled ARM backend (
(anon)::ARMAsmParser::matchAndEmitInstruction@0x15185a20, the Armv8.1-M low-overhead-loopWLS/LEvalidation), not to any TPUInsertAddressHandlerLooppath. The TPU AddressHandler loop carries an iteration count (bundles − 1), not a signed even byte-offset; there is no even-multiple constraint in the TPU encode path.
The AddressHandler loop persists across v3/v4 only (JellyfishTarget / PufferfishTarget overrides, both live with real proto construction). The ViperfishTarget::InsertAddressHandlerLoop override (0x1d49b980) exists but is a __noreturn stub that fatals "Deepsea version not supported" (target_viperfish.h:320) — so the AddressHandler-style hardware loop is dropped at Viperfish (v5); there is no Ghostlite or 6acc60406 override at all.
| Element | BarnaCore / AddressHandler (HW loop) | TC / SparseCore (SW loop) |
|---|---|---|
| Loop begin | bcLOOP_SETUP (load count) + bcLOOP_START (1 bound operand); BCAH BeginLoop sets loop_start_ | preheader: init scalar IV (MOV / ADDri) |
| Body length | loop_count = bundles − 1 in the ScalarSlot_Loop proto; body ≥ 2 instr | implicit (basic-block span) |
| Loop end | bcLOOP_END — HW decrement+test+branch-back | ADDri + CMPxx + BRcond back-edge |
| Counter | dedicated HW loop register (1; LCC0/LCC1 on PF) | scalar-reg IV; HW LCC mirrors count (readable) |
| Trip source | bound operand (reg / immediate via SETUP) | CMPxxri (imm) or CMPxxrr (reg, dynamic) |
| Live index read | bc.extractvalue.loopindex | ReadRegisterLcc{Low,High} / indexed Tcs/Bcs enum |
| Nesting | single active loop (loop_start_ != kNoLoopActive CHECK) | software IVs nest freely; only innermost HW-counted |
Nesting Model
Hardware-loop nesting is intentionally bounded:
- AddressHandler / BarnaCore: one active hardware loop at a time, enforced by the single
loop_start_field and theBeginLoopCHECK above. There is no loop-counter stack at this level — no nested AddressHandler / BarnaCore hardware loops. - The LLVM-generic option strings "force-nested-hardware-loop" / "nested hardware-loops not supported" confirm the default TPU
HardwareLoopspath does not support nested hardware loops; the common case is one hardware-counted innermost loop with outer loops handled as software back-edges. - Pufferfish's two LCC registers (LCC0, LCC1) allow reading two distinct loop counters — supporting at most a depth-2 nest where inner and outer each have a distinct counter, selected by the read-register enum value. Whether LCC0/LCC1 correspond to (outer, inner) nest levels or (TC-issued, BC-issued) loops is not traced here.
- V5+ has a single implicit LCC, so only the innermost hardware-counted loop's count is directly readable; outer loops use software IVs.
- The LLO loop region (
LloRegionMember::kLoop) can nest in the IR — akLoopmember's sub-region may contain anotherkLoop— but only the innermost gets the hardware counter; the compiler flattens, unrolls, or software-counts the rest.
What Is Not Yet Pinned
- The bcLOOP field bit positions. The trip-count immediate field of
bcLOOP_SETUPand the body-offset fields ofbcLOOP_START/bcLOOP_ENDwithin the BarnaCore bundle are routed by the LLVM MC encoder (TPUMCCodeEmitter); the per-opcodeInstBitsrecords are not byte-decoded here. The ops and their roles are pinned; the exact field widths are not. - The silicon counter width. The 64-bit readback (lo+hi) is fixed; whether the down-counter is the full 64 bits or narrower is not separable from the binary.
- PF LCC0 vs LCC1 assignment policy. Two counters exist; which compiler pass picks LCC0 vs LCC1, and whether the choice tracks nest level or issuing engine, is not traced here.
- The three loop-terminator MC opcodes (
325,403,328) and the predicate opcode540. The integersanalyzeLoopForTPUPipeliningmatches are byte-exact; their symbolic LLVM-MC names (which ofbcLOOP_END/ branch terminators they correspond to) are not individually resolved here.
Cross-References
- Sequencer Slot — the lane-0 scalar slot the loop-control ops (
BRcond,bcLOOP_*,ReadRegisterLcc) occupy, and the JF software-bundle-index loop vs the V5+ LCC read. - Immediate Slot — immediate slot 0, where the signed-20-bit branch / back-edge target and the loop bound land.
- Bundle Model — the per-generation bundle widths and the codec keyed by
(TpuVersion, TpuSequencerType)that hosts these slots.