Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SC Backend Pipeline

Addresses, symbols, and pass ordering apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build-id 89edbbe81c5b328a958fe628a9f2207d); the binary is shipped with full C++ symbols (not stripped). Other versions differ.

Abstract

When XLA lowers a SparseCore offload computation, the dense tpu-dialect IR is run through a fixed back-end pass pipeline before it becomes encodable SC bundles. That pipeline is built and driven by CustomKernelEmitter::RunPasses (0x13202780). This page documents the pipeline as a contract: the exact ordered list of twelve passes, the per-pass MLIR factory entry point and namespace, how each is attached (top-level addPass vs func-nested nest), the device-id remap pass that runs second-to-last, and the MEGACORE barrier — the even/odd core-pair cross-core synchronization the pipeline's final lowering pass emits whenever the part is a two-core SparseCore megachip.

The single most important structural fact is that RunPasses does not build one PassManager holding twelve passes. It builds and runs twelve separate single-pass PassManagers in sequence, each carrying exactly one pass, each with its own IR-print gating closure, and each driven independently through jellyfish::RunPass. This matters for reimplementation: there is no shared pass-manager analysis cache across the twelve stages, IR dumps are gated per-stage, and the order is hard-wired in the call sequence, not assembled from a registry. The LLVM reader should hold the analogy loosely — this is a hand-rolled equivalent of running twelve mlir::PassManager::run calls back-to-back on the same ModuleOp, not a single populated pipeline.

The second structural fact is that the MEGACORE barrier is not a barrier kind. EmitScsBarrier (the SC-side barrier dispatcher reached from the last pass) only dispatches GLOBAL and CUSTOM; the MEGACORE enum value is illegal there (it RetChecks). The megacore even/odd partner sync is an inner code path that both the GLOBAL and the CUSTOM lowerings run whenever Target::LogicalDevicesPerChip(SparseCore) == 2. "Megacore" names the hardware topology (a two-core megachip), not a barrier the producer selects.

For reimplementation, the pipeline contract is:

  • Twelve single-pass managers, fixed order. ConvertIntegerMemrefs → InferMemRefLayout → Canonicalizer → CanonicalizeOperations → CanonicalizeMemorySpace → TilingPropagation → mosaic_sc::InferVectorLayout → mosaic_sc::InsertRelayout → mosaic_sc::ApplyVectorLayout → CSE → LogicalToPhysicalDeviceId → LowerToMlo. The order is a vector-layout-then-lower sequence; reordering the layout trio (7/8/9) or moving LogicalToPhysicalDeviceId after LowerToMlo breaks the build.
  • Passes 7/8/9 are mlir::mosaic_sc, not mlir::tpu. The SC vector-layout trio (InferVectorLayout, InsertRelayout, ApplyVectorLayout) lives in the mosaic_sc namespace. Pass 3 is the generic mlir::createCanonicalizerPass; passes 4/5 are the tpu-specific canonicalizers.
  • The device-id remap must run immediately before LowerToMlo. LogicalToPhysicalDeviceIdPass (pass 11) rewrites the device_id operand of tpu::EnqueueDMAOp / tpu::SemaphoreSignalOp from a logical id to a physical core id (i32(CoreIndex * stride + TileId) for SC). It must run before the barrier-lowering pass so the sync ops LowerToMlo emits target the correct physical cores.
  • MEGACORE is a topology gate, not a barrier type. The even/odd partner sync (partner = local core id XOR 1) is gated by LogicalDevicesPerChip(SC) == 2 inside both the GLOBAL and CUSTOM barrier arms. On a single-core part (LDPC(SC) == 1) the partner path is suppressed (it RetChecks) and only the local sync remains.
Pipeline builderxla::tpu::sparse_core::CustomKernelEmitter::RunPasses @ 0x13202780
Per-pass driverjellyfish::RunPass @ 0x14514d60; mlir::PassManager ctor @ 0x1cb700a0; attach via OpPassManager::addPass @ 0x1cb6c000 / ::nest @ 0x1cb6d3e0
Pass count / shape12 passes, each in its own single-pass PassManager (not one 12-pass pipeline)
Last passcreateLowerToMloPass @ 0x1322adc0 — SC → MLO lowering; collective op → EmitScsBarrier
Device-id remappass 11 createLogicalToPhysicalDeviceIdPass @ 0x132d5720; body runOnOperation @ 0x132d5ec0; GetPhysicalCoreID @ 0x132d8de0
SC barrier dispatchCollectiveEmitterBase::EmitScsBarrier @ 0x13352500GLOBAL(1)/CUSTOM(3) only
Megacore gateTarget::LogicalDevicesPerChip(SparseCore) == 2; partner = OffloadFactory::ToPartnerGlobalCoreId @ 0x133e66e0 (core XOR 1)

The Twelve-Pass Pipeline

RunPasses(ModuleOp module, bool, bool, BarrierConfig const* bc) saves the BarrierConfig* (it will be threaded into the last pass), then runs twelve passes in order. Each pass goes through the same four-step idiom:

  1. construct a fresh mlir::PassManager (ctor 0x1cb700a0) with a name and a Nesting;
  2. install the IR-print "should-print" closure (the per-stage VLOG/dump gate);
  3. attach the one pass via OpPassManager::addPass (top-level, operates on the ModuleOp) or OpPassManager::nest<func::FuncOp>(...).addPass (func-nested, operates per-function);
  4. run it via jellyfish::RunPass(pm, module, ...).

Between passes RunPasses reads the HLO BackendConfig and the SparseCoreConfig defaults plus the per-element layout flags (ShouldEnableLarge2ndMinorLayoutForX16/X8/X4) to fill in pass options — most visibly the tiling span handed to InferMemRefLayout and the InferVectorLayoutPassOptions / ApplyVectorLayoutPassOptions.

tpu-dialect SparseCore module (post region→sequencer outlining)
   │
   ▼  RunPasses @0x13202780  — TWELVE single-pass PassManagers, in order:
   │
   1  createConvertIntegerMemrefsPass        addPass   int-memref legalization
   2  createInferMemRefLayoutPass            nest      tile/layout inference (X16/X8/X4 flags) (per-func)
   3  createCanonicalizerPass                addPass   generic greedy canonicalization
   4  createCanonicalizeOperationsPass       nest      tpu op canonicalization      (per-func)
   5  createCanonicalizeMemorySpacePass      nest      memory-space canonicalization (per-func)
   6  createTilingPropagationPass            nest      propagate {N,2} tiling        (per-func)
   7  mosaic_sc::createInferVectorLayoutPass nest      SC vector layout inference    (per-func)
   8  mosaic_sc::createInsertRelayoutPass    nest      insert layout-conversion ops  (per-func)
   9  mosaic_sc::createApplyVectorLayoutPass nest      apply chosen vector layouts   (per-func)
   10 createCSEPass                          addPass   common-subexpression elim.
   11 createLogicalToPhysicalDeviceIdPass    nest      device_id → physical core id  (per-func)  ── §Device-Id Remap
   12 createLowerToMloPass                   addPass   SC → MLO lowering (LAST)                  ── reaches EmitScsBarrier
   │
   ▼
MLO IR  ─→  SC bundle emission (per-engine SCS/TAC/TEC codecs)

Per-pass table

The twelve create*Pass calls appear in exactly this order; passes 7/8/9 are mosaic_sc-qualified; the addPass/nest split matches the call sites one-for-one.

#Factory (VMA)NamespaceAttachmlir::Pass subclass / role
1createConvertIntegerMemrefsPass @ 0x132bca60mlir::tpuaddPassConvertIntegerMemrefsPass — integer-memref legalization
2createInferMemRefLayoutPass @ 0x132c0f00mlir::tpunestInferMemRefLayoutPass — tile/layout inference; opts = (int, Span<long const> tiling, TpuTilingFlags) built from the X16/X8/X4 layout flags
3createCanonicalizerPass @ 0x1c941920mlir (generic)addPassCanonicalizer — generic greedy canonicalization
4createCanonicalizeOperationsPass @ 0x132bb260mlir::tpunestCanonicalizeOperationsPasstpu op canonicalization
5createCanonicalizeMemorySpacePass @ 0x132a2280mlir::tpunestCanonicalizeMemorySpacePass — memory-space canonicalization
6createTilingPropagationPass @ 0x132e0900mlir::tpunestTilingPropagationPass — propagate {N,2} tiling; opts = (array<long,2>, bool)
7createInferVectorLayoutPass @ 0x132ecf60mlir::mosaic_scnestInferVectorLayoutPass — SC vector-layout inference; opts = InferVectorLayoutPassOptions
8createInsertRelayoutPass @ 0x132eff80mlir::mosaic_scnestInsertRelayoutPass — insert layout-conversion ops
9createApplyVectorLayoutPass @ 0x132e57e0mlir::mosaic_scnestApplyVectorLayoutPass — apply chosen layouts; opts = ApplyVectorLayoutPassOptions
10createCSEPass @ 0x1c93e180mlir (generic)addPassCSE — common-subexpression elimination
11createLogicalToPhysicalDeviceIdPass @ 0x132d5720mlir::tpunestLogicalToPhysicalDeviceIdPass — remap DMA/semaphore device_id → physical core id (see §Device-Id Remap)
12createLowerToMloPass @ 0x1322adc0 (LAST)mlir::tpuaddPassLowerToMloPass — SC → MLO lowering; the collective op lowers into EmitScsBarrier

The per-pass construction idiom

Every one of the twelve stages is built with the same fixed sequence of calls — the body of RunPasses is, in effect, twelve copies of this template differing only in the factory and the attach mode:

// for each pass p in the twelve:
pm = mlir::PassManager(ctx, name, Nesting)        // 0x1cb700a0 — a fresh single-pass manager
pm.enableIRPrinting( $_0 should-print closure )   // 0x132048e0 — the per-stage IR-dump gate
op = pm.<addPass | nest<func::FuncOp>().addPass>( create<P>Pass(opts) )   // 0x1cb6c000 / 0x1cb6d3e0
jellyfish::RunPass(pm, module, dump_prefix, ...)  // 0x14514d60 — run this one pass over the module
// pm destructed before the next stage

Three of the twelve carry a non-trivial options struct, and RunPasses materializes those options between passes by reading the surrounding configuration:

  • Pass 2 (InferMemRefLayout) — the Span<long const> tiling and TpuTilingFlags are built from Target::ShouldEnableLarge2ndMinorLayoutForX16 / X8 / X4 (0x1d6b6920 / 0x1d6b6860 / 0x1d6b6840); the layout flags differ per generation and per element width, so this option block is the per-gen seam in the pipeline.
  • Passes 7 and 9InferVectorLayoutPassOptions / ApplyVectorLayoutPassOptions are filled from the SparseCoreConfig defaults (SparseCoreConfig_globals_ @ 0x223a99c8) and the HLO BackendConfig (0xf58e6c0).
  • Pass 11 (LogicalToPhysicalDeviceId) — its optional<DeviceAssignment> argument is sourced from HloModuleConfig::static_device_assignment() (0x10fb7f60); the ChipTopology and the trailing bool come from the emitter's target. See §Device-Id Remap.

NOTE — twelve managers, not one. RunPasses constructs a new mlir::PassManager for each pass and tears it down before the next. There is no carried analysis state and no fused nesting: each nest-attached pass is wrapped in its own manager whose top level is the ModuleOp and whose single nested pass runs over every func::FuncOp. A reimplementation that builds one PassManager and addPasses all twelve will behave equivalently for correctness but will not reproduce the per-stage IR-dump gating or the per-stage RunPass timing the binary emits, and folds analysis caches the binary deliberately discards.

GOTCHA — the layout trio order is load-order-sensitive. InferVectorLayout (7) annotates the chosen vector layouts, InsertRelayout (8) materializes conversion ops where neighbors disagree, and ApplyVectorLayout (9) rewrites ops to their chosen layout. Running ApplyVectorLayout before InsertRelayout leaves layout mismatches with no conversion op to bridge them; running either before InferVectorLayout rewrites against an unset layout. The three are a strict three-phase sequence, all mosaic_sc, all func-nested.


Device-Id Remap (Pass 11)

Pass 11, LogicalToPhysicalDeviceIdPass, is the reason the pipeline can emit cross-core barriers that hit the right hardware. It runs immediately before the final LowerToMlo lowering and rewrites the device_id operand on the DMA/semaphore ops from a logical device index to a physical SparseCore core id. The pass name strings are read from .rodata: kPassName = "LogicalToPhysicalDeviceIdPass", kArgumentName = "logical-to-physical-device-id".

Construction

The factory (0x132d5720, signature (optional<DeviceAssignment>, ChipTopology, bool)) heap-allocates the pass, memcpys the optional<DeviceAssignment> (setting its present-bit) and the ChipTopology into the pass object, and zero-inits the per-CoreType topology maps. In RunPasses the DeviceAssignment source is HloModuleConfig::static_device_assignment() (0x10fb7f60), which feeds the pass-11 construction.

runOnOperation (0x132d5ec0)

The body reads Target::CoresPerChip(TpuCoreType), Target::LogicalDevicesPerChip(TpuCoreType), and TpuChipConfig::Megachip() to build a FlatHashMap<CoreType, CoreTopology> from the copied ChipTopology + DeviceAssignment, then walks each func::FuncOp (mlir::detail::walk, primary callback 0x132d71c0). The callback matches ops carrying a (target_core_type, device_id) operand pair and rewrites the device_id:

for each matched op in func:
    phys = GetPhysicalCoreID(builder, loc, op.getTargetCoreType())   // §below
    op.getDeviceIdMutable().assign(phys)                             // MutableOperandRange::assign

The decompiled callback (0x132d71c0) confirms it handles tpu::EnqueueDMAOp and tpu::SemaphoreSignalOp (both via getTargetCoreType / getDeviceIdMutable), calls GetPhysicalCoreID at four sites, and commits via MutableOperandRange::assign. The pass may also insertArguments to thread a physical-core-index function argument.

GetPhysicalCoreID (0x132d8de0)

This is the logical→physical materializer. For CoreType == SparseCore it emits the MLIR op chain (decompile lines 177–183, in order):

ci   = sparse_core::CoreIndexOp::create(b, loc)            // current SC core index
k    = arith::ConstantIndexOp::create(b, loc, stride)      // per-tile physical stride
mul  = arith::MulIOp::create(b, loc, ci, k)
tid  = sparse_core::TileIdOp::create(b, loc)               // tile id within the core
add  = arith::AddIOp::create(b, loc, mul, tid)
i32  = Builder::getI32Type()
phys = arith::IndexCastOp::create(b, loc, i32, add)        // → physical_core_id : i32
                                                           //   = i32(CoreIndex * stride + TileId)

For the non-SC (TensorCore) path it returns llo::CoreIndexOp::create(b, loc) directly. The stride is selected by walking the ChipTopology core-entry array (per-core role tag), and its literal value depends on the chip-config proto.

NOTE — why this must precede LowerToMlo. The barrier lowering in pass 12 turns the collective op into hardware sync (SemaphoreSignal / SyncAdd) whose target is the device_id operand. If those operands still held logical ids, the emitted barrier would signal the wrong physical cores. Running the remap as pass 11 guarantees every DMA/semaphore op carries a physical core id by the time LowerToMlo consumes it.


The MEGACORE Barrier

The last pass, LowerToMloPass, lowers the SC collective op; for a barrier it reaches CollectiveEmitterBase::EmitScsBarrier (0x13352500). The megacore even/odd core-pair sync lives inside the two barrier arms that emitter dispatches — it is not a fourth dispatch case.

Dispatch: EmitScsBarrier (0x13352500)

The emitter reads the HLO BackendConfigBarrierConfig (defaulting to BarrierConfig_globals_ when absent), then switches on barrier_type at offset +0x20 (decompiled as *((_DWORD*)cfg + 8)):

barrier_typeDispatchSFLAG-number source
1 GLOBALEmitGlobalBarrier @ 0x13352820GetSyncFlagForBarrierId(reserved id) (see Barrier → SFLAG Binding)
3 CUSTOMEmitCustomBarrierFromConfig @ 0x13352cc0EmitCustomBarrierStart @ 0x13352fc0GetSyncFlagForBarrierId(colored id)
0 / 2 / 4 (incl. MEGACORE=4)RetCheckFailSlowPath (source line 0x5f)n/a — illegal at SC kernel emission

The decompiled body confirms this exactly: v11 == 1 calls EmitGlobalBarrier, v11 == 3 calls EmitCustomBarrierFromConfig, and the else arm RetCheckFails with the message "backend_config.barrier_config().barrier_type() == jellyfish::BarrierType::CUSTOM".

GOTCHA — MEGACORE is not a barrier kind. BarrierType::MEGACORE(4) reaching EmitScsBarrier is a hard error (RetCheck). The enum value is a producer-side annotation, not something the SC kernel emitter consumes. The actual megacore behaviour is a topology-gated partner cross-core sync (ToPartnerGlobalCoreId + partner SyncAddOp) shared by both the GLOBAL and CUSTOM arms whenever LogicalDevicesPerChip(SC) == 2; the explicit even/odd Predicated/Not leader/partner predication is materialized in EmitGlobalBarrier (the GLOBAL arm), while the CUSTOM start emitter carries the partner-add without the Predicated wrapping. A reimplementer must not model "megacore barrier" as a distinct lowering branch.

Megacore split: EmitGlobalBarrier (0x13352820)

When the part is a two-core megachip, EmitGlobalBarrier builds an even/odd leader/partner predicate and emits the barrier as two mutually-exclusive predicated regions. The decompiled body shows the arithmetic in order:

ChipCount();  ldpc = LogicalDevicesPerChip(SC);          // total SC cores = ChipCount * ldpc
// even/odd LSB selector:
Shli(x, 1); And(...);  Compare(eq)  ── is_even predicate (LSB == 0 ⇒ leader)
lowering_util::Assert(...)
core_index    = GetCoreIndex()
chip          = DivU(core_index, CoresPerChip(SC))
gcid          = ToGlobalCoreId(chip, ...)
pred          = Compare(gcid, ...)                       ── leader/partner predicate
Predicated(pred,      $_0)   ── EVEN / LEADER  arm  @0x13355b60
Predicated(Not(pred), $_1)   ── ODD  / PARTNER arm  @0x13355f80

The two arms

Each arm is a __policy_func thunk taking an OpBuilder& and returning a Status. Decompile op-create counts (verified per arm):

ArmSyncWaitOplocal SyncAddOppartner SyncAddOpscf::ForOp
$_0 even / leader (0x13355b60)111 (inside the ForOp)1
$_1 odd / partner (0x13355f80)1110

Each arm:

  • emits sparse_core::SyncWaitOp::create (0x14618fa0) tagged via setAttr with a getStringAttr twine of value 259 (= 0x103 = IciDim::kCoresOnChip) — i.e. the wait is explicitly marked a cross-core (cores-on-chip) wait. The literal 259 is read directly from the decompiled thunk (v46 = 259);
  • emits the simple local sparse_core::SyncAddOp::create (0x14611f40) — the local sync increment;
  • emits the megacore-partner sparse_core::SyncAddOp::create variant (0x146120c0, the longer (…CoreTypeAttr, bool) signature) targeting the partner core, with the cross-core sflag offset computed via SubsliceToFullSlice (0x133e79a0). In the leader arm this partner add is wrapped in an scf::ForOp (0x17866d60) over LogicalDevicesPerChip(SC).

The 0x103/kCoresOnChip tag ties this barrier to the same megacore knob the SC tensor-split uses — see Megacore Even/Odd Split.

The partner: ToPartnerGlobalCoreId (0x133e66e0)

The cross-core SyncAdd targets the partner core, computed by OffloadFactory::ToPartnerGlobalCoreId. The decompiled body is unambiguous:

RetCheck( LogicalDevicesPerChip(SC) == 2 )               // megacore-only; else RetCheck line 0x50
local = GetCoreIndex()
chip  = DivUIOp(local, CoresPerChip(SC))
gcid  = ToGlobalCoreId(chip, ...)
return XOrIOp(gcid, ConstantIndexOp(1))                  // flip the low bit ⇒ even↔odd partner

The partner is the current core's id with its low bit flipped (XOR 1): core 0 ↔ core 1 within the megachip pair. The body flattens first (ToGlobalCoreId(chip)) and then XORs the resulting global id with 1. On failure of the LDPC(SC) == 2 RetCheck, the slow path logs "GetPartnerGlobalCoreId() is only available for 2 logical devices per chip configurations." — there is no partner unless the part is a two-core megachip.

Worked rendezvous — what the two arms accomplish

Concretely, take a megachip with SC cores {0, 1} (LDPC(SC) == 2). Core 0 is even (leader, runs $_0); core 1 is odd (partner, runs $_1). Each core's emitted program does, in MLIR terms:

core 0 ($_0 leader):                       core 1 ($_1 partner):
  SyncWait(local sflag, tag=kCoresOnChip)    SyncWait(local sflag, tag=kCoresOnChip)
  SyncAdd(local sflag)                       SyncAdd(local sflag)
  for d in 0..LDPC(SC):                       SyncAdd( partner = ToPartnerGlobalCoreId()
    SyncAdd( partner = ToPartnerGlobalCoreId()              = (1) XOR 1 = core 0 )
            = (0) XOR 1 = core 1 )

The partner of core 0 is 0 XOR 1 = 1; the partner of core 1 is 1 XOR 1 = 0. Each core increments the other core's sync flag (the cross-core SyncAdd via the SubsliceToFullSlice addressing into the partner's sflag window) and waits on its own (the kCoresOnChip-tagged SyncWait), so the pair rendezvous before either proceeds. This is the structure that is entirely absent on a single-core part — there the partner adds and the even/odd predication are suppressed and the barrier degenerates to a purely local SyncAdd/SyncWait.

The CUSTOM arm shares the partner cross-core sync

EmitCustomBarrierStart (0x13352fc0, reached from the CUSTOM(3) dispatch via the color-id closure thunk 0x13355420) calls ToPartnerGlobalCoreId, GetSyncFlagForBarrierId (the colored barrier id → SFLAG number), and the megacore-partner SyncAddOp variant (0x146120c0, twice — once for the local side, once cross-core via SubsliceToFullSlice 0x133e79a0) — the same partner cross-core sync the GLOBAL arm uses. The cross-core SFLAG addressing source is the only first-order difference between GLOBAL and CUSTOM (reserved id vs colored id). Note that EmitCustomBarrierStart itself does not call Predicated/Not: unlike EmitGlobalBarrier, the even/odd predication is not materialized inside the CUSTOM start emitter (it carries the partner-add machinery but not the explicit leader/partner predicate split). What is shared between the two arms is the ToPartnerGlobalCoreId partner core computation and the partner SyncAddOp, not the Predicated wrapping.

GOTCHA — single-core parts have no partner path. On a part where LogicalDevicesPerChip(SC) == 1, ToPartnerGlobalCoreId RetChecks, so the partner-add and the even/odd predication never fire — only the local SyncAdd / SyncWait remain. A reimplementer must gate the entire partner machinery on LDPC(SC) == 2, not emit it unconditionally.


How the SC Pipeline Contrasts with the TensorCore Stack

The TensorCore scheduling stack is three stages over two IRs (HLO latency-hiding scheduler → MXU/MRB assignment → LLO bundle packer), priced by a bundle cost model. The SC back-end pipeline is a different animal: a flat twelve-pass MLIR pipeline over the tpu/mosaic_sc dialects ending in an MLO lowering, with no separate resource-assignment stage and no cost-model-priced reordering. The SC pipeline's only "scheduling-adjacent" decisions are vector layout (passes 7–9) and the per-engine outlining that precedes this pipeline (see Region → Sequencer Outliner). The two pipelines meet only at the chip boundary: the SC barrier this pipeline emits synchronizes against the TensorCore through the shared sync-flag pool, and the device-id remap (pass 11) is what makes those cross-engine signals land on the right physical cores.


Cross-References