Loop Tiling & Unrolling
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d). Other versions will differ..textaddresses are virtual; for this binary.textVMA == file offset0xe63c000.
Abstract
There is no single "loop optimizer" in the TPU compiler. What a reader coming from LLVM expects — one pass that tiles a loop nest and unrolls the inner loop by a cost-driven factor — is split across three unrelated mechanisms operating on three different IRs and gated by three independent flags. This page owns the parts of that machinery that are decisions about loop shape: the TileKind rule that stamps each shape's layout with a compact-or-SparseCore tile tag (TpuTilingAssignment::GetCopyTileKind, 0x13dd0ca0), the LoopConfig / LoopUnrollConfig protos that carry the per-loop-dimension unroll directives a SparseCore kernel was given (decoded field-by-field), and the HLO while-loop pipeline unroller (WhileLoopPipelineUnroller, 0x12ee2200) that turns a software-pipelineable while into a depth-stage chained-call body.
The thing called "tiling" here is not a GPU/Triton tile size. On the TPU TensorCore path it is a 16-bit TileKind tag written into every Shape::Layout — for a copy, the pair (input_kind, output_kind) with each kind in {Default=0, SparseCore=3} — that the rest of the backend reads to know whether a buffer keeps a real 2-D tile or is SparseCore-laid-out. The genuinely cost-ranked MXU operand-window tiling lives on the dot/conv → MXU lowering page; the SparseCore tile-index address expansion (ExpandTiledMemRefs) lives on Tile-Index Expansion. The LLVM-side modulo scheduler that finds the initiation interval for the loops these passes leave behind is on Bundle Modulo Scheduling. This page does not re-derive any of those; it owns the TileKind decision, the LoopConfig schema, and the unroll/pipeline transforms.
For reimplementation, the contract is:
- TileKind is a layout tag, not a tile size.
TpuTilingAssignmentwalks the module after layout assignment and stamps each shape'sLayoutwith aTileKind. For akCopyit computes a packed(input,output)pair from fourTransferSizeUtillayout predicates; every other opcode inherits the layout-assignment tile viaHardwareLayout::GetDefaultLayout. There is no rich per-opcode tile-shape solver on this path. - The unroll factor a SparseCore loop carries lives in the
LoopConfigproto. Field 3 (unrolled_loops, repeatedLoopUnrollConfig) keys each entry on aloop_dim(field 1) and carries either an explicitunroll_factoror an "auto" sentinel (a oneof). The normalizer resolves "auto" by dividing the loop bound by thevectorizing_shape(field 4) element count, with an exact-divisibility CHECK. - The per-arch SparseCore copy unroll factor is a two-template switch.
CustomLoopUnrollPolicy<5>(template constanttpu::TpuVersion 5= 6acc60406, marketing "Ironwood") vs<3>(template constanttpu::TpuVersion 3= viperfish, the fallback for every version that is not 6acc60406), dispatched onTarget::tpu_version. The<3>template emits 16 (elementwise) / 8 (structured);<5>emits transpose{16,8}, general 32/16, or MD-vectorizing16/pack. - While-loop unrolling and software-pipelining are independent, additive transforms.
WhileLoopUnroller(full / double-buffer / auto) andWhileLoopPipelineUnroller(loop-carry-depth pipelining) are gated by two separate env knobs and can both run on the same module. The pipeliner clones the body intodepthchained call stages and decrements the trip count bydepth-1.
| TileKind stamp (HLO) | xla::jellyfish::TpuTilingAssignment : HloPassInterface; name() returns "tiling-assignment" |
| TileKind decision | TpuTilingAssignment::GetCopyTileKind @ 0x13dd0ca0 (decompiled, byte-anchored) |
| TileKind driver | VerifyOrAssignTiling @ 0x10922a20; RunImpl @ 0x13dd10a0; Verify @ 0x13dd2900 |
| Post-fusion special tiling | TpuPostFusionTilingAssignment @ RunImpl 0x13dd85a0; name() returns "post-fusion-tiling-assignment" |
| LoopConfig proto | xla::jellyfish::LoopConfig (serializer 0x1d6eade0); LoopUnrollConfig (serializer 0x1d6f2680) |
| Unroll arithmetic | LoopConfigWrapper::GetNormalizedUnrollFactor @ 0x13d6c1c0 (decompiled) |
| Per-arch SC copy factor | GetCustomLoopUnrollPolicy @ 0x13916ec0; CustomLoopUnrollPolicy<(tpu::TpuVersion)5>::GetConfig<(HloOpcode)44> @ 0x13916fe0; <(tpu::TpuVersion)3> @ 0x139173a0 |
| While-loop unroller | xla::WhileLoopUnroller; IsLoopUnrollable @ 0x12ee8620; name() returns "while_loop_unroller" |
| Pipeline unroller | xla::WhileLoopPipelineUnroller::RunImpl @ 0x12ee2200; ComputeWhileLoopPipelineDepth @ 0x12ee0fc0; name() returns "while_loop_pipeline_unroller" |
| SC window selector | WindowUnrollFactorSelector::Select @ 0x1385c360; name() returns "window-unroll-factor-selector" |
| Pipeline host | PostOptimizationPipeline @ 0x1093fd40 (HLO loop passes); RunBackendWithBufferAssignment @ 0x13070bc0 (SC selector) |
| Confidence | CONFIRMED (byte-anchored) unless a row or callout says otherwise |
The three "loop" mechanisms, kept apart
The word "tiling" and the word "unrolling" each name three different things in this binary. A reimplementer must keep them separate or build the wrong solver.
| Concern | Mechanism (this page) | IR it acts on | Decision style |
|---|---|---|---|
| TileKind layout tag | TpuTilingAssignment stamps (Default|SparseCore) per shape | HLO Shape::Layout | correctness/layout predicate, not a cost model |
| SparseCore loop unroll factor | LoopConfig.unrolled_loops (written by WindowUnrollFactorSelector + CustomLoopUnrollPolicy); applied by MLIR LoopUnrollPass | jellyfish HLO → SC MLIR scf.for | per-arch table + scratchpad-bounded greedy fit |
| HLO while-loop unroll / pipeline | WhileLoopUnroller, WhileLoopPipelineUnroller | HLO kWhile | trip-count/body-size thresholds; loop-carry depth |
Three things this page deliberately does not own, with their homes:
- The cost-ranked MXU operand-window tiling (
IterateThroughWindowConfigs, priced in MXU cycles, bounded by VMEM) — that is the only genuinely cost-driven tiling search, and it is the convolution path. See Dot / Conv → MXU Lowering. - The SparseCore tile-index address expansion (
ExpandTiledMemRefs/expandTiledIndices) — the per-tile memref index algebra. See Tile-Index Expansion. - The bundle modulo scheduler — the LLVM-side initiation-interval (II) search that software-pipelines the LLO bundle stream the SC loop unroller leaves behind. This is a different "pipelining" from the HLO while-loop pipelining on this page. See Bundle Modulo Scheduling.
NOTE — "tiling" the TensorCore assigns is a tag, not a size. The HLO
TpuTilingAssignmentpass does not solve for a tile shape. The actual tile geometry (Default 128×16 / X64 / X128) is produced by layout assignment and stored in theLayout; this pass only writes theTileKindenum that says which family of tile a shape uses, so the lowering legalizer can read it back. A reimplementer who builds a tile-size search here is building the wrong pass.
TileKind: the layout stamp
Driver and mode (VerifyOrAssignTiling, 0x10922a20)
xla::jellyfish::DeepseaCompilerBase::VerifyOrAssignTiling(const Target&, HloModule*) is the single entry that decides whether — and how — TileKind tags are committed. It reads a tri-state int at TpuCompilationEnvironment + 0xDFC (3580 decimal; the flag xla_tpu_verify_or_assign_tiling_before_lowering, which has two additional suffixed variants ...lowering3 / ...lowering8 present as separate flag strings):
absl::Status DeepseaCompilerBase::VerifyOrAssignTiling(
const Target& target, HloModule* module) {
int mode = *(int*)(GetTpuCompEnv(target) + 3580); // env + 0xDFC
if (mode == 1) { // VERIFY
TpuTilingAssignment pass(target, /*ctor_bool=*/false); // ctor arg = 0
return pass.Verify(module, /*exec_threads=*/{}); // deepsea_compiler_base.cc:3053
}
if (mode == 2) { // ASSIGN
TpuTilingAssignment pass(target, /*ctor_bool=*/true); // ctor arg = 1
return pass.Run(module, /*exec_threads=*/{}).status(); // deepsea_compiler_base.cc:3056
}
return absl::OkStatus(); // mode == 0: leave tiling as-is
}
The TpuTilingAssignment(const Target&, bool) ctor (0x13dd1080) selects verify-vs-assign internally; the decompile shows mode == 1 constructs with bool = 0 and calls Verify(), while mode == 2 constructs with bool = 1 and calls Run(). So the ctor bool is not a "verify_only" flag — 1 is the assign/Run path. The manual override is one tri-state int: 0 skip, 1 verify, 2 assign.
The per-copy TileKind algorithm (GetCopyTileKind, 0x13dd0ca0)
This is the one TileKind rule recovered byte-for-byte. For a kCopy, it returns a StatusOr packing a 16-bit TileKind = (input_kind | (output_kind << 8)), each kind in {Default=0, SparseCore=3}. The input side reads operand(0)'s shape; the output side reads copy.shape(); both run the identical four-predicate chain.
StatusOr<uint16_t> GetCopyTileKind(const HloInstruction& copy, const Target& target) {
TransferSizeUtil* tu = target.transfer_size_util(); // Target + 0x3B8 (qword 119)
uint16_t in_kind, out_kind;
// --- INPUT side: operand(0) shape ---
const Shape& in = copy.operand(0)->shape();
int ms = in.layout().memory_space(); // Shape + 0x138 (byte 312)
const Shape& real_in = (ms == kVmem /*3*/ || ms == kSmem /*5*/)
? in.tuple_element_or_self() : in;
if (real_in.layout().has_minor_to_major() // Shape + 0x130 (byte 304) == 1
&& in.layout().minor_to_major().size() >= 2 // layout()[9] >= 2
&& !tu->HasLinearLayout(in)
&& !tu->HasSparseCoreLayout(topo, in)
&& !tu->HasPadless2ndMinorLayout(topo, in)
&& !tu->HasLarge2ndMinorLayout(topo, in)) {
in_kind = 0; // Default (compact 2-D tile)
} else if (tu->HasSparseCoreLayout(topo, in)) {
in_kind = 3; // SparseCore tile
} else {
return InvalidArgument(
"Input shape does not have compact or sparse core layout."); // :66
}
// --- OUTPUT side: copy.shape() (same chain) ---
const Shape& out = copy.shape();
if (/*…same four-predicate chain on out…*/) {
out_kind = 0;
} else if (tu->HasSparseCoreLayout(topo, out)) {
out_kind = 0x300; // = 3 << 8
} else {
return InvalidArgument(
"Output shape does not have compact or sparse core layout."); // :77
}
return in_kind | out_kind; // stored at *((uint16_t*)result + 4); status word = OK(1)
}
The structure is verbatim from the decompile (0x13dd0ca0): input kind v18 ∈ {0,3}, output kind v23 ∈ {0,0x300}, result v18 | v23 written at this+8. VMEM(3)/SMEM(5) operands dereference the tuple-element shape (v9 = *(v7+8)) before the has-bit check. The four TransferSizeUtil layout predicates are the gate — they decide whether a side keeps a real 2-D tile (kind 0) or is SparseCore (kind 3):
| Predicate | Address | Meaning |
|---|---|---|
HasLinearLayout(Shape) | 0x1d6af220 | untiled/linear layout (host-transfer boundary) |
HasSparseCoreLayout(TpuTopology, Shape) | 0x110b7440 | SparseCore tiling |
HasPadless2ndMinorLayout(TpuTopology, Shape) | 0x1d6af3e0 | no-pad 2nd-minor |
HasLarge2ndMinorLayout(TpuTopology, Shape) | 0x1d6af2e0 | "special" / large 2nd-minor |
GOTCHA — the copy TileKind is a pair, and a mixed pair is legal. A
kCopywhose operand is compact and whose result is SparseCore-laid-out returns0 | 0x300 = 0x0300; the all-SparseCore case is0x0303; the all-compact case is0x0000. A reimplementer who collapses TileKind to a single per-shape value will fail to round-trip the copies that re-tile across the compact↔SparseCore boundary.
Every other opcode inherits the layout-assignment tile
RunImpl (0x13dd10a0) does not compute an (input,output) pair for non-kCopy ops. It walks the result shape with ShapeUtil::ForEachMutableSubshape (the $_4/$_5 visitors at 0x13dd26a0 / 0x13dd27e0) and stamps HardwareLayout::GetDefaultLayout(result_shape) into each leaf's mutable_layout(). The only opcodes with special handling are:
kCopy— the(input,output)pair above.kOutfeed(opcode 80) — usesmutable_outfeed_shape().- tuple-result ops (e.g.
kReduce-class) — recurse the tuple via the subshape visitor, stamping each leaf's default tile. - async-SparseCore (
async_execution_thread == "sparsecore") — reads theSparseCoreConfigfrom the async-chain start's backend config and stampsHardwareLayout::FromProto(config)— the T8 / SC-tiling opt-in path (IsT8CustomKernelInstruction,0x13dd0e80).
A dtype gate guards the general stamp: element_type must be in the supported-tiled mask 0x2FFF91FFE (or a small extra set), else the instruction must be fused (CHECK instruction->IsFused(), tpu_tiling_assignment.cc:233) and is skipped.
NOTE — there is no rich per-HLO-opcode TileKind divergence. The only "rules" are the four above; everything else preserves the layout-assignment-chosen tile via
GetDefaultLayout. This is why TileKind is documented as a stamp, not a solver.
The post-fusion pass TpuPostFusionTilingAssignment (RunImpl 0x13dd85a0) runs much later, after copy insertion, and forward-propagates "special tiling" (the HasLarge2ndMinorLayout family, gated by xla_tpu_enable_large_2nd_minor_layout[_for_x{4,8,16}]) from producers that CanProduceSpecialTiling (0x13dd7760) to consumers that AcceptsSpecialTiling (0x13dd6580), subject to the entry in/out set and alias constraints. The per-opcode predicate set is named but its producer→consumer rule is not decompiled line-by-line (see Confidence table).
The LoopConfig / LoopUnrollConfig protos
A SparseCore kernel carries its loop-tiling and unroll directives in a LoopConfig proto attached per fusion. The WindowUnrollFactorSelector writes it; LoopConfigWrapper reads it back; the MLIR LoopUnrollPass applies it. Both schemas were decoded field-by-field from their _InternalSerialize wire emitters (field number = tag byte >> 3; offsets are struct byte offsets in the generated C++ message; has-bits in _has_bits_).
xla::jellyfish::LoopConfig (serializer 0x1d6eade0)
| Fld | Wire tag | Type | Struct off | Name (inferred) | Semantics |
|---|---|---|---|---|---|
| 1 | 0x0A | repeated int64 (packed) | RepeatedField +0x18, cnt int +0x1C, data ptr +0x20; has-bit &1 | loop_bounds | per-dim trip / index space |
| 2 | 0x10 | int64 | +0x58; has-bit &8 | (scalar) | aux scalar (trip/total) |
| 3 | 0x1A | repeated message | RepeatedPtrField +0x30, cnt int +0x38; has-bit &2 | unrolled_loops | repeated LoopUnrollConfig |
| 4 | 0x22 | repeated int64 (packed) | RepeatedField +0x40, cnt int +0x50, data ptr +0x48; has-bit &4 | vectorizing_shape | per-dim native vector shape |
xla::jellyfish::LoopUnrollConfig (serializer 0x1d6f2680)
| Fld | Wire tag | Type | Struct off | Name (inferred) | Semantics |
|---|---|---|---|---|---|
| 1 | 0x08 | int64 | +0x18 | loop_dim | the dim this entry keys on (join key); has-bit _has_[0]&1 |
| 2 | 0x10 | int64 (1 byte stored) | +0x30 | auto_kind | "auto/full" sentinel — oneof case 2 (shared union slot) |
| 3 | 0x18 | int64 | +0x30 | unroll_factor | explicit factor — oneof case 3 (shared union slot) |
| 4 | 0x20 | bool | +0x28 | pipeline_remainder | remainder-loop pipelining flag; has-bit _has_[0]&4 |
| 5 | 0x28 | int64 | +0x20 | (aux) | auxiliary value; has-bit _has_[0]&2 |
Fields 2 and 3 form a oneof: their payloads share the same union slot at +0x30, and the discriminator oneof_case is the dword at +0x38 (2=auto, 3=explicit). This is confirmed twice: the serializer at 0x1d6f2680 reads case *((uint32*)this+14) (byte +0x38) and writes either field-2 byte +0x30 or field-3 qword +0x30; and GetLoopUnrollFactor (0x13d6c100) reads the copied-out oneof_case from +0x38 and the payload from +0x30. loop_dim (field 1) is the join key: GetLoopUnrollConfig(dim) (0x13d6c080) linearly scans unrolled_loops for the entry whose field-1 == dim.
NOTE — the proto field names are inferred from wire-format reverse engineering, not symbol strings. The tags, types, and offsets are byte-anchored from the serializers; the human-readable names (
loop_bounds,unrolled_loops,vectorizing_shape,loop_dim,unroll_factor) are the most consistent reading of the surrounding code and CHECK strings. MEDIUM confidence on the names; CONFIRMED on the layout.
The unroll-factor arithmetic
GetNormalizedUnrollFactor (0x13d6c1c0)
This is the algorithm that turns a LoopUnrollConfig entry into a concrete factor for a given loop bound. Decompiled (0x13d6c1c0):
StatusOr<int64_t> LoopConfigWrapper::GetNormalizedUnrollFactor(
absl::Span<const long> bounds, long dim) const {
const LoopUnrollConfig* cfg = GetLoopUnrollConfig(dim); // scan unrolled_loops by loop_dim
if (!cfg) return 1; // no entry → factor 1
int64_t raw = (cfg->oneof_case() == 2) ? (-(uint8_t)cfg->auto_kind() | 1) // auto sentinel
: (cfg->oneof_case() == 3) ? cfg->unroll_factor() // explicit
: 1;
if (raw != -1) return raw; // explicit / non-auto factor is final
// --- AUTO path (raw == -1): divide loop bound by the VECTORIZING SHAPE ---
int vs_size = loop_config_.vectorizing_shape_size(); // field 4 count, env-of-this + 0x44
if (vs_size == 0)
return InvalidArgument("Vectorizing shape missing"); // 25-char
if (vs_size >= 2)
return InvalidArgument(
"Vectorizing shape has too many dimensions: %d", vs_size); // 45-char fmt
int64_t loop_bound = bounds[dim];
int vectorizing_shape = loop_config_.vectorizing_shape(0)[8]; // the dim count
CHECK(loop_bound % vectorizing_shape == 0) // loop_config_wrapper.cc:358
<< loop_bound << " % " << vectorizing_shape;
return loop_bound / vectorizing_shape;
}
The decompile confirms this exactly: the linear scan over unrolled_loops (at a2+48, count a2+56) matching loop_dim (*v7+3), the oneof read (v36 == 2 → -(uint8)v35 | 1; == 3 → v35; else 1), the != -1 short-circuit, and the auto path reading vectorizing_shape_size (*(a2+68)), the two error strings, bounds[dim], vectorizing_shape(0)[8], and the loop_bound % vectorizing_shape == 0 CHECK at line 358 with LogMessageFatal.
GOTCHA — the "auto" divisor is
vectorizing_shape, not the unroll-config count. The normalized factor for anauto-tagged loop isloop_bound / vectorizing_shape[0].dim_count, requiring exact divisibility. A reimplementer who divides byunrolled_loops_size(a plausible misreading) will produce wrong factors and miss the"loop_bound % vectorizing_shape == 0"invariant. The supporting helpers:GetLoopUnrollFactor(0x13d6c100) returns the raw factor (oneof 3 →unroll_factor; oneof 2 →-(uint8)auto_kind);GetLoopPipelineRemainder(0x13d6c4e0) returns(remainder_byte | 0x100)when set, default0x101.
The per-arch SparseCore copy policy
GetCustomLoopUnrollPolicy(SmallVector<long,6> bounds, HloInstruction, Target) (0x13916ec0) dispatches on Target::tpu_version (Target + 0x398):
LoopUnrollPolicy GetCustomLoopUnrollPolicy(const SmallVector<long,6>& bounds,
const HloInstruction& hlo,
const Target& target) {
int version = target.tpu_version(); // Target + 0x398 (920)
const HloInstruction* copy = lowering_util::GetCopyInstruction(hlo);
if (!copy) return {}; // empty policy
LoopUnrollPolicy p =
(version == 5) // tpu::TpuVersion 5 = 6acc60406 ("Ironwood")
? CustomLoopUnrollPolicy<5>::GetConfig<kCopy>(bounds, *copy, target)
: CustomLoopUnrollPolicy<3>::GetConfig<kCopy>(bounds, *copy, target); // every other version
// CHECK each returned unroll_dimension ∈ [0, bounds.size()) (perf_utils.cc:151/152)
return p;
}
The C++ tpu::TpuVersion enum is the proto enum minus one — kJellyfish=0, kDragonfish=1, kPufferfish=2, kViperfish=3, kGhostlite=4, k6acc60406=5 — so the dispatch literal 5 is 6acc60406 (the TPU_VERSION_* proto descriptor numbers these 1..6, confirming the −1 offset). There are exactly two arch templates: <(tpu::TpuVersion)5> (6acc60406, marketing "Ironwood") taken only when version == 5, and <(tpu::TpuVersion)3> (template constant viperfish) used as the fallback for every other version (jellyfish through ghostlite).
CustomLoopUnrollPolicy<(tpu::TpuVersion)3>::GetConfig<kCopy> (0x139173a0, the fallback template) — decompiled, byte-anchored:
LoopUnrollPolicy CustomLoopUnrollPolicy</*tpu::TpuVersion*/3>::GetConfig<kCopy>(
const SmallVector<long,6>& bounds, const HloInstruction& hlo, const Target&) {
CHECK(hlo.opcode() == HloOpcode::kCopy); // perf_utils.cc:43
int inner = bounds.back();
bool elementwise = lowering_util::IsElementwiseCopy(hlo);
// ONE entry: dim = inner - 1, factor = 8 * elementwise + 8
return { { /*dim=*/inner - 1, /*factor=*/ (elementwise ? 16 : 8) } };
}
The decompile shows *(v5+8) = 8 * IsElementwiseCopy + 8 and *(v5) = v3 - 1 — i.e. factor 16 for an elementwise copy, 8 for a structured copy, on the innermost dim.
CustomLoopUnrollPolicy<(tpu::TpuVersion)5>::GetConfig<kCopy> (0x13916fe0, 6acc60406 / "Ironwood") — decompiled byte-for-byte:
LoopUnrollPolicy CustomLoopUnrollPolicy</*tpu::TpuVersion*/5>::GetConfig<kCopy>(
const SmallVector<long,6>& bounds, const HloInstruction& hlo, const Target& target) {
CHECK(hlo.opcode() == HloOpcode::kCopy); // perf_utils.cc:76
int rank = bounds.size(); int inner = bounds.back();
PrimitiveType dtype = hlo.shape().element_type();
// dtype acceptance mask 0x2FFF91FFE ∪ {0x20,0x21,15,18}; else FATAL (primitive_util.h:757)
// TRANSPOSE-COPY special case: sub-word packed dtype (mask 0x910) &&
// IsMinorTransposeCopy(hlo) && rank >= 2 (CHECK "rank > 1", perf_utils.cc:81) && inner < 32
if (sub_word_packed(dtype) && lowering_util::IsMinorTransposeCopy(hlo)
&& rank >= 2 && inner < 32) {
return { {rank-1, 16}, {rank-2, 8} }; // TWO entries
}
// GENERAL case
bool pred1 = TransferSizeUtil::ShouldPackPREDAsSingleBit(topo, hlo.shape());
int pack = TransferSizeUtil::ElementPackingFactor(dtype, pred1);
CHECK(target.SupportsSparseCore()); // target.h:1709
int scs_tc = target.topology()->sc_tile_count(); // topology + 148
int subl = target.SublaneCount();
LoopConfigWrapper w = LoopConfigWrapper::Create(hlo, rank, scs_tc, pack, subl);
int factor = lowering_util::IsMDVectorizingShape(pack, target, /*…*/)
? 16 / pack
: ((inner % (32 * scs_tc)) == 0 ? 32 : 16);
return { {rank-1, factor} }; // ONE entry
}
| Template | Case | Unroll factor |
|---|---|---|
<3> (fallback: every version != 6acc60406) | elementwise copy | 16 |
<3> | structured copy | 8 |
<5> (6acc60406 / "Ironwood") | narrow transpose copy (inner < 32) | {16, 8} (two dims) |
<5> | MD-vectorizing shape | 16 / pack |
<5> | general, inner % (32·scs_tc) == 0 | 32 |
<5> | general, otherwise | 16 |
NOTE — the template constants are not marketing chip names.
CustomLoopUnrollPolicy<5>is the C++ template parametertpu::TpuVersion 5= 6acc60406 (the proto descriptor numbers itTPU_VERSION_6acc60406 = 6, so the C++ enum is proto−1);<3>is the viperfish constant used as the catch-all for every version the dispatch does not route to<5>. BothGetConfig<kCopy>bodies are decompiled byte-for-byte:<3>at0x139173a0(8 * IsElementwiseCopy + 8→ 16/8),<5>at0x13916fe0(transpose{16,8}; general16 * (inner % (32·scs_tc) == 0) + 16→ 32/16; MD-vectorizing16 / pack). CONFIRMED.
SparseCore window unroll: scratchpad-bounded greedy
WindowUnrollFactorSelector is the HLO pass that picks the gather/scatter window unroll factor and writes the CustomLoopUnrollPolicy result into the LoopConfig.unrolled_loops proto. It is added in the SparseCore backend's own HLO sub-pipeline by RunBackendWithBufferAssignment (0x13070bc0) as AddPass<WindowUnrollFactorSelector, Target const*, long> where the long is FLAGS_xla_sc_tiles; it is one of the "two late-running annotation passes immediately before lowering" (CHECK pipeline.PassesSize() == 2, sparse_core_compiler.cc:599).
Select(instr, bool) (0x1385c360):
- Recognize gather-offload / scatter-offload custom fusions; for an offloaded op, log
"But this is an offloaded op. So, we will not find an unroll factor."and bail (no factor). - Extract the inner
kGather/kScatterand classify the access pattern (IsSublane/IsElement/IsLane/IsChunkGather/Scatter). - Read the SC scratchpad budget by mode: tile →
MaxTileScratchpadSizeInBytes; SCS →MaxScsScratchpadSizeInBytes; loop-fusion →FusionEmitter::GetReservedScratchpadBytes(orGetReservedSmemBytes). - Pick the largest candidate factor that fits — a greedy resource fit, not a roofline cost.
- For each returned
CustomLoopUnrollPolicyentry, build aLoopUnrollConfig(field-1 =loop_dim, oneof-3 =unroll_factor) andAddit to the parentLoopConfig.unrolled_loops.
The per-candidate fit test, ChunkGatherWindowSizeFitsInScratchpad(target, instr, factor) (0x1385c240):
window_elems = Product(GetSliceSizesTiledPadding(instr))
bytes = window_elems * ByteSizeOfPrimitiveType(dtype)
sized = 8 * ((bytes >> 2) + 1) * factor // round up to 4-byte words × 8 sublanes × factor
return lowering_util::FitsInScratchPad(target, sized)
After the proto is written, the MLIR LoopUnrollPass::runOnOperation (0x1352ca20) walks every scf.for (walk<scf::ForOp>, pre-order) and applies the factor; the VectorUnrollPass then splits wide vector ops to native lane width. These MLIR-side passes are summarized here only to close the data path; their bodies belong to the SparseCore lowering pages.
HLO while-loop unrolling
Two open-source passes run in PostOptimizationPipeline (0x1093fd40), the last HLO pipeline before the latency-hiding scheduler. They are gated independently and can both run on the same module.
WhileLoopUnroller — full / double-buffer / auto
Gated by *(TpuCompEnv + 4904) != 0 (a pointer to the xla_while_loop_unroll_count). When added (AddPass<WhileLoopUnroller, long, bool> @ 0x1096ee60), it builds a 0x30-byte object:
| Offset | Value | Meaning |
|---|---|---|
+8 | *(env + 4904) | unroll_count |
+16 | 0 | wrap_in_trip_count_remainder (hard-false from this site) |
+24 | 64 | kUnrollTripCountThreshold (full-unroll trip cap) |
+32 | 800 | kUnrollInstructionCountThreshold (body-size cap) |
+40 | 10000 | kUnrollExpandFactorThreshold (trip × instrs expand cap) |
The four modes (DebugOptions strings): WHILE_LOOP_UNROLLING_NO_UNROLL (disabled), _DOUBLE_BUFFER (factor 2), _FULL_UNROLL (needs static trip ≤ 64), _AUTO_UNROLL (factor 2 only if the body contains a collective). Failure strings: "Cannot unroll while loop. The trip count is greater than the threshold: … Threshold: " and "Cannot unroll while loop. Too many instructions in the body: ".
IsLoopUnrollable(HloInstruction*) (0x12ee8620) — the 9-step legality gate (src while_loop_unroller.cc, decompiler-confirmed at the named address):
- opcode ==
kWhile(0x82) — line 1222. - single loop-carried tuple (
operands().size() == 1) — line 1225. - no control predecessors — line 1238 (
"…due to control dependency: "). while_bodyandwhile_conditioncontain none of the Send/Recv family{kSend,kRecv,kSendDone,kRecvDone}— line 1252 ("…because it contains a send/recv node: ").operand(0)opcode ==kTuple(0x81) — line 1259.while_condition->HasSideEffect() == false— line 1269.GetLoopInductionVarTupleIdxsucceeds — line 1277.HloEvaluator::Evaluate(IV init)succeeds — line 1287.MatchTrivialLoopTripCountsucceeds — lines 1295/1299.
On all-pass it stores {while, init_value, trip_count, iv_tuple_idx, is_unrollable=true}.
NOTE — the exact 4-opcode
flat_hash_setbacking arrays are not byte-decoded. IDA mislabels the rodata adjacency (unk_AE07CA8/unk_AE07CAC) as ASCII; the family is fixed as Send/Recv by the diagnostic string, not by decoding the initializer list. HIGH confidence on the family, LOW on the byte-exact opcode list.
WhileLoopPipelineUnroller — software-pipelining
Gated by EnablePipelinedLoopUnrolling(env) (0x1d6b71a0), which reads an AutoProto at TpuCompEnv + 752 (xla_tpu_enable_pipelined_loop_unrolling): "set" iff (~AutoOr<bool>::FromProtoOrDie(proto) & 0x101) == 0. In PostOptimizationPipeline it is wrapped by TpuAnnotateTraceableLoops(true) before and (false) after.
ComputeWhileLoopPipelineDepth(const HloInstruction&) (0x12ee0fc0) — the loop-carry depth = number of pipeline stages = number of iterations a value lives before consumption. It CHECKs kWhile (line 44, "while_instruction.opcode() == HloOpcode::kWhile") and that the while-body root's shape is a tuple (line 52, CHECK(while_root->shape().IsTuple()), "While Instruction has not been canonicalized to have a tuple shape"), then walks the root tuple's operands: a kGetTupleElement (0x40) reading parameter(0) at a tuple index ≠ i is a carry edge (slot rotation), recorded in a flat_hash_map<int64,int64> (the swiss-table SIMD probe is visible as the _mm_crc32_u64/vpcmpeqb inner loops). A deque-BFS over the carry-edge graph plus a binary-GCD reduction over chain lengths yields the depth. Depth < 2 ⇒ the caller skips the loop.
RunImpl (0x12ee2200) — the transform. For each loop with depth >= 2, it clones the body into depth chained call stages and decrements the trip count by depth-1:
StatusOr<bool> WhileLoopPipelineUnroller::RunImpl(HloModule* module, threads) {
for (HloInstruction* loop : while_loops_with_depth_ge_2) {
int64_t depth = ComputeWhileLoopPipelineDepth(*loop);
VLOG(1) << "Unrolling: " << loop->name() << " unroll_factor: " << depth; // :129
// New outer body "%s.unrolled_%dx": chain `depth` clones of the body as calls.
HloComputation::Builder b(Format("%s.unrolled_%dx", body->name(), depth));
HloInstruction* cur = b.AddInstruction(Parameter(0, loop->shape(), "input_tuple"));
HloComputation* outer = module->AddEmbeddedComputation(b.Build());
for (int64_t i = 0; i < depth; ++i) {
HloComputation* stage = module->AddEmbeddedComputation(
body->Clone(Format("unrolled_%dx_step_%d", depth, i)));
cur = outer->AddInstruction(Call(loop->shape(), {cur}, stage));
}
outer->set_root_instruction(cur);
HloComputation* new_cond = module->AddEmbeddedComputation(
cond->Clone(Format("unrolled_%dx", depth)));
HloInstruction* nw = loop->parent()->AddInstruction(
While(loop->shape(), new_cond, outer, loop->mutable_operand(0)));
// Lift depth-1 iterations into the implicit prologue/epilogue.
Status s = WhileUtil::IncrementWhileLoopTripCount(*nw, /*increment=*/ 1 - depth); // :176
nw->set_while_body(outer);
if (s.ok()) RETURN_IF_ERROR(loop->ReplaceOperandWith(0, nw));
else VLOG(1) << "Failed to unroll: " << loop->name(); // :178
}
RETURN_IF_ERROR(FlattenCallGraph().Run(module, threads)); // :188 — inline the stages
RETURN_IF_ERROR(/*follow-on pass*/.Run(module, threads)); // :190
return changed;
}
The pipelining semantics: the body B becomes depth chained call stages B0→B1→…→B_{depth-1} inside one outer body. Because each stage consumes the previous stage's output tuple, the rotated loop-carry values are produced in stage k and consumed in stage k+1 within one outer iteration — depth original iterations are in flight simultaneously. The trip count drops by depth-1 (IncrementWhileLoopTripCount(1 - depth), helper 0x1e3ae7c0): the first depth-1 fills are the implicit prologue and the last depth-1 are the implicit drain, folded into the chained-call structure. FlattenCallGraph then inlines the stages, so the residual counted loop body of depth × original size is what the LLVM modulo scheduler (Bundle Modulo Scheduling) sees and overlaps across iterations.
GOTCHA — unroll and pipeline are not mutually exclusive. There is no single "unroll-or-pipeline" switch. Small static-trip loops get full/double/auto unrolled (knob A:
env+4904); loops with a genuine loop-carry depth ≥ 2 additionally get software-pipelined (knob B:EnablePipelinedLoopUnrolling). A loop can be candidate-checked by both passes; once unrolled into straight-line code it no longer matches the pipeliner'skWhilecheck, so the order (unroll → pipeline) is what makes them compose rather than conflict.
Cost-model interaction
On the TPU path, the loop transforms on this page are constrained, not cost-ranked:
- SparseCore window unroll — greedy: the largest factor that fits the scratchpad (
FitsInScratchPad). A resource fit, not a roofline. - HLO TileKind — a correctness/layout decision driven by
TransferSizeUtilpredicates, not a cost model. - While-loop unrolling — gated by trip-count (≤ 64) / body-size (≤ 800) / expand (≤ 10000) thresholds.
- Pipeline depth — a structural property (loop-carry distance), computed exactly.
The one genuinely cost-driven tiling search is the convolution MXU window tiling (MXU cycles + VMEM fit) — and it is not on this page; see Dot / Conv → MXU Lowering.
Worked example: a SparseCore gather loop on 6acc60406 ("Ironwood")
Given a SparseCore custom-fusion that gathers a window into VMEM inside while (i < 512), on a 6acc60406 target (tpu::TpuVersion 5, the only version that takes the <5> copy policy), window slice sizes [1, 8, 128] (BF16, 2 B), loop-carry rotation depth 3:
- TileKind (
TpuTilingAssignment, post-fusion): each VMEM buffer's layout already carries its compact tile; the fusion'skCopyoutputs getGetCopyTileKind→0x0000(compact in, compact out), or0x0303if SparseCore-laid-out. - Window unroll (
WindowUnrollFactorSelector):window_elems = 1·8·128 = 1024;bytes = 2048; per-factor size= 8·((2048>>2)+1)·f = 4104·f. Pick the largestfwith4104·f ≤ S(e.g.S = 64 KiB→f ≤ 15→f = 8). TheCustomLoopUnrollPolicy<5>copy factor for BF16 (pack=1): MD-vectorizing →16/1 = 16; elseinner=128,32·scs_tc(sayscs_tc=4→ 128) →128 % 128 == 0→ 32. The selector clamps the copy unroll to what the scratchpad allows. LoopConfigwritten:unrolled_loops += { loop_dim = inner, unroll_factor = f };vectorizing_shape = [16].GetNormalizedUnrollFactor: explicit →f; auto (-1) →bounds[inner] / vectorizing_shape = 128 / 16 = 8.- MLIR
LoopUnrollPassunrolls thescf.forover the window by the factor; remainder =GetLoopPipelineRemainder. - HLO pipelining (if
xla_tpu_enable_pipelined_loop_unrolling):ComputeWhileLoopPipelineDepth = 3⇒ pipeline. Body cloned 3× asunrolled_3x_step_{0,1,2}, chained as calls; new trip= 512 − 2 = 510;FlattenCallGraphinlines the stages. The 510-iteration counted loop feeds the hardware loop counter; the modulo scheduler overlaps the 3 stages.
Result: window tile 1×8×128 (BF16), copy unroll 32 (or scratchpad-clamped 8), scf.for unrolled ×8, while-loop software-pipelined depth 3 (trip 510), inner loop modulo-scheduled.
Manual override flags
| Flag | Effect |
|---|---|
xla_tpu_verify_or_assign_tiling_before_lowering | tri-state: 0=skip, 1=verify, 2=assign (env +0xDFC); ...3/...8 per-arch variants |
xla_tpu_enable_untiled_layout / xla_tpu_untiled_layout_for_1 | allow linear (untiled) layout (TpuTilingRewriter) |
xla_tpu_experimental_enable_small_minor_tiling | enable small-minor special tiling |
xla_tpu_enable_large_2nd_minor_layout[_for_x{4,8,16}] | special / large-2nd-minor tiling (post-fusion) |
xla_sc_tiles | SparseCore tile count (the WindowUnrollFactorSelector long arg) |
xla_while_loop_unroll_count | WhileLoopUnroller unroll factor / trip bound (env +4904 pointer; gate A) |
xla_tpu_enable_pipelined_loop_unrolling | enable WhileLoopPipelineUnroller (env +752 AutoProto; gate B) |
xla_sc_disable_remainder_loop_pipelining / xla_sc_max_pipelining_stages | SC remainder-loop pipelining disable / stage cap |
xla_tpu_scatter_partial_unroll_factor / xla_tpu_unroll_strided_remote_dma | scatter / strided-DMA loop unroll |
NOTE — the
xla_gpu_*/xla_cpu_*tiling/unroll flags in this binary are not on the TPU path.SymbolicTileAnalysis,TiledHloSchedule,xla_cpu_matmul_tiling_*,xla_gpu_max_kernel_unroll_factor, etc. ship inlibtpu.sobut are driven only by thexla::cpu::/xla::gpu::emitters. A reimplementer should ignore them for TPU codegen.
Confidence Summary
| Claim | Evidence |
|---|---|
TileKind is a (input,output) 16-bit pair packed by GetCopyTileKind | decompiled 0x13dd0ca0: v18|v23, error lines 66/77, mem-space byte 312, has-bit byte 304 |
Four TransferSizeUtil predicates gate compact-vs-SparseCore | called in GetCopyTileKind; addrs 0x1d6af220/110b7440/1d6af3e0/1d6af2e0 |
Non-kCopy ops inherit layout via GetDefaultLayout; only kOutfeed/tuple/async-SC are special | RunImpl 0x13dd10a0, subshape visitors 0x13dd26a0/27e0, dtype CHECK :233 |
VerifyOrAssignTiling tri-state at env +0xDFC (3580); mode 1→ctor(0)+Verify, mode 2→ctor(1)+Run | decompiled 0x10922a20; deepsea_compiler_base.cc:3053/3056 |
LoopConfig/LoopUnrollConfig field layout (tags, offsets, oneof) | wire serializers 0x1d6eade0/0x1d6f2680 |
Auto unroll factor = loop_bound / vectorizing_shape with divisibility CHECK | decompiled 0x13d6c1c0; CHECK loop_config_wrapper.cc:358 |
<3> (viperfish, fallback) SC copy factor 16 (elementwise) / 8 (structured) | decompiled 0x139173a0: 8*elementwise + 8 |
<5> (6acc60406) SC copy factors (transpose {16,8}; general 32/16; MD 16/pack) | decompiled 0x13916fe0 byte-for-byte |
tpu::TpuVersion C++ enum = proto−1; dispatch version==5 ⇒ 6acc60406 | GetCustomLoopUnrollPolicy 0x13916ec0; TPU_VERSION_* proto descriptor numbers 1..6 |
Window selector picks largest scratchpad-fitting factor; writes LoopConfig | Select 0x1385c360, fit test 0x1385c240 (8·((bytes>>2)+1)·f) |
While-loop unroll thresholds 64/800/10000; gate at env +4904 | AddPass 0x1096ee60 object fields; IsLoopUnrollable 0x12ee8620 |
IsLoopUnrollable 9-step gate; Send/Recv family forbidden | 0x12ee8620, src lines 1222–1299 |
| Pipeline depth = loop-carry rotation distance; depth ≥ 2 to pipeline | ComputeWhileLoopPipelineDepth 0x12ee0fc0 |
Pipeliner clones body into depth chained calls; trip −= depth−1 | RunImpl 0x12ee2200; IncrementWhileLoopTripCount(1-depth) 0x1e3ae7c0 |
| Unroll and pipeline gated independently, can both run | two gates in PostOptimizationPipeline 0x1093fd40 (env+4904, EnablePipelinedLoopUnrolling 0x1d6b71a0) |
TpuPostFusionTilingAssignment propagates special tiling | RunImpl 0x13dd85a0, AcceptsSpecialTiling 0x13dd6580, CanProduceSpecialTiling 0x13dd7760 |
Cross-References
- The TPU Compiler — Part V orientation; where these loop passes sit in
PostOptimizationPipeline(Phase 1) relative to layout assignment, fusion, and the scheduler. - Fusion Patterns — the fusion that runs before
TpuPostFusionTilingAssignmentre-tiles its outputs; the copies the post-fusion stamp re-tiles are introduced here and by copy insertion. - Dot / Conv → MXU Lowering — the cost-ranked MXU operand-window tiling (
IterateThroughWindowConfigs), the second meaning of "tiling" this page is careful not to claim. - tpu → LLO Lowering — the legalizer that reads the
TileKindtag this page stamps and the loop the unroller leaves behind. - Tile-Index Expansion — the SparseCore per-tile memref index algebra (
ExpandTiledMemRefs), the address side of SC tiling. - Bundle Modulo Scheduling — the LLVM-side initiation-interval search that software-pipelines the LLO bundle stream of the loops left after HLO unroll/pipeline; the third meaning of "pipelining".
- Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part V — Compiler: Lowering & Optimization Passes / Fusion, dot/conv, tiling — back to index