Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SC ↔ MXU Handshake

Every op name, symbol, operand count, assertion string, and literal on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d; build libtpu_lts_20260413_b_RC00, LLVM SHA 8918319853fbdf9e6f6cb69e96848f913a22bc31). Other versions differ.

Abstract

This page is the cross-engine boundary between the SparseCore (SC) co-processor and the TensorCore (TC) systolic MXU — the handshake that lets a recommender / DLRM model feed gathered embedding rows out of SC and into a dense matmul. SC is good at exactly what the MXU cannot do (index-driven gather of single HBM rows, atomic FP-add scatter); the MXU is good at exactly what SC cannot do (128×128 weight-stationary dense matmul). A XlaSparseDenseMatmul is the fusion of those two: SC gathers the per-sample embedding rows, the MXU multiplies them against the dense MLP weights. The two engines run asynchronously, so the entire boundary reduces to three concerns: how the data crosses (SC writes a tile into TC's VMEM operand pool), how the engines synchronize (the SFLAG sync-flag protocol with TC-side tpu.sem_wait), and how the MXU binds the result (the contracting-depth contract — the SC tile lands as the MXU moving operand, reduced over the dense MxuContractingSize, not over any sparse contracting dimension).

This page documents only the boundary. The intra-SC gather datapath (the Stream slot, the IndirectStream form, the per-element address formula) is owned by Stream Gather/Scatter; the TC MXU op family (vlatch / vmatprep / vmatmul / vmatres) is owned by MXU Slot; the SC on-chip geometry and the bump allocator are owned by SparseCore Architecture. Here we connect them. The three handoff channels (HBM bulk, VMEM-targeted DMA, SFLAG control plane) and the lowering symbols that emit each are the recoverable surface; the byte-layout of the routing key (DeviceAndCoreIds) and the per-gen SFLAG-address bit-encoding are the open items.

For reimplementation, the contract is:

  • There is exactly ONE direct SC→VMEM data route and ONE direct VMEM→SC route. Of the 40 mlir::sparse_core::tpu_dma_<src>_to_<dst>_sc_<form> ops, only tpu_dma_hbm_to_vmem_sc_general (the SC→MXU feed) and tpu_dma_vmem_to_hbm_sc_general (the MXU result writeback) cross the SC↔TC VMEM boundary, plus tpu_dma_vmem_to_vmem_sc_general. All three are general-form only (16 operands, full descriptor). Every other SC↔TC transfer transits HBM. A reimplementer must route the embedding-feed tile through this one op.
  • The sync is a writer-count SFLAG, waited on from the TC side. SC raises a sync flag on tile completion; TC's tpu.sem_wait (SemaphoreWaitOp, 2 operands) advances when the flag's writer count reaches the threshold. The flag lives in the global sflag subspace for cross-CORE coordination; the cross-sub-engine flags (sflag_scs, sflag_tile) are SC-internal and never cross to TC directly.
  • The MXU consumes the SC tile as the dense moving operand. The matmul reduces over Target::MxuContractingSize (128 base, 256 on the Ghostlite class). MxuSparseContractingSize is 0 on every generation — the sparse-MXU contracting path is unused in this build, so the embedding-feed matmul runs the ordinary dense contracting datapath. The SC-vs-MXU split is engine-level, not a sparse contracting mode.
  • The pipeline is double-buffered. OffloadFactory::AllocateSflag(builder, /*set*/false, /*count*/2) allocates one flag per buffer: SC fills buffer A while the MXU drains buffer B, then they flip. This is what hides the SC gather latency behind MXU compute.
HLO boundary markerXlaSparseDenseMatmulOp (+ Grad / CSR / optimizer-fused / minibatch / megachip variants)
SC→MXU data routemlir::sparse_core::tpu_dma_hbm_to_vmem_sc_general (16 operands; general form only)
MXU→SC writebackmlir::sparse_core::tpu_dma_vmem_to_hbm_sc_general (16 operands)
DMA loweringLowerMemrefToMlo::lowerEnqueueDma (0x135105a0) → sc_tpu.dma_general_start + sc_tpu.dma_wait
kStream vs kDmaxla::tpu::sparse_core::GetTransferKind (0x1351b140; (Target&, src, dst, 4×bool))
TC-side sync opstpu.sem_wait / tpu.sem_signal / tpu.sem_read / tpu.fetch_and_add_sync
SFLAG subspacessflag (cross-CORE) · sflag_scs (SCS scope) · sflag_tile (TEC/tile scope)
Cross-sub-engine syncOffloadFactory::SyncScsWithTec (0x133e9260) / SyncTecWithScs (0x133e8fe0)
MXU contracting depthTarget::MxuContractingSize = 128 base / 256 Ghostlite class; MxuSparseContractingSize = 0 (unused)
Double-bufferOffloadFactory::AllocateSflag(b, false, 2) — one flag per buffer

The Three Handoff Channels

Purpose

SC and TC share three physical media: HBM (the global table store), TC's VMEM (the MXU operand feed), and the SFLAG register file (the control plane). The compiler picks a channel per transfer; the choice is what determines whether the handshake is a slow global round-trip or a fast direct operand feed.

The Channels

Channel 1 — HBM bulk (slow, global):
  SC stream engine  --STREAM_OPCODE_GATHER / _SCATTER_FLOAT_ADD-->  HBM
  TC                --tpu.enqueue_dma (DMA_MEMORY_ID_VMEM)------->  reads HBM into VMEM
  sync: SC stream op carries a completion sflag; TC tpu.sem_wait on the same flag.

Channel 2 — VMEM-targeted DMA (fast, programmed):       <-- THE SC->MXU FEED
  SC  --tpu_dma_hbm_to_vmem_sc_general (16 operands)----------->  TC VMEM
  the destination sflag (TC side) AND the source DMA-done sflag (SC side)
  are both carried in the operand vector.

Channel 3 — SFLAG-only control plane (no data):
  SCS  --SetSyncFlag / AddSyncFlag / AtomicRemoteWriteSetDone-->  shared SFLAG
  TC   --tpu.sem_wait / tpu.fetch_and_add_sync---------------->  keyed on same SFLAG addr
  remote address encoded per-gen by EncodeRemoteSyncFlagAddress{JfDf,Pufferfish,Viperfish}
  and xla::ghostlite::dma_utils::EncodeRemoteSyncFlagAddress.

The Canonical Embedding-Feed Fastpath

There is exactly one direct SC→VMEM route and one TC→SC-via-VMEM route — verified by enumerating all 40 sparse_core::tpu_dma_* ops in the decompile, of which only three name vmem:

tpu_dma_hbm_to_vmem_sc_general    16 operands   ← SC produces, MXU consumes
tpu_dma_vmem_to_hbm_sc_general    16 operands   ← MXU result writeback
tpu_dma_vmem_to_vmem_sc_general   16 operands   ← VMEM→VMEM intra-TC

So the canonical SC→MXU path is:

SC TEC  ->  TILE_SPMEM  ->  SPMEM  ->  HBM  ->  VMEM  ->  MXU       (full aggregation)
SC TEC  ->  TILE_SPMEM  ->  HBM    ->  VMEM  ->  MXU                (direct, skips SPMEM)

The short form skips the SPMEM aggregation stage when a per-tile result is destined straight for MXU consumption; there is no spmem_to_vmem route, so SPMEM-resident tiles always transit HBM before reaching VMEM.

NOTE — the absence of simple and single_strided forms for the VMEM routes is structural, not an omission: *_to_vmem_sc_general and vmem_to_*_sc_general exist only in the general (16-operand) form. SC↔TC VMEM transfers always carry a full multi-dim DMA descriptor. A reimplementer cannot emit a "simple" 8-operand DMA across the SC↔TC boundary — the geometry of an embedding tile (rows × embedding_dim, possibly strided per logical replica) always needs the full descriptor.

Function Map

FunctionAddressRole
LowerMemrefToMlo::lowerEnqueueDma0x135105a0tpu.enqueue_dmasc_tpu.dma_{simple,general,single_strided}_start + sc_tpu.dma_wait
LowerMemrefToMlo::lowerEnqueueIndirectDma0x13511da0tpu.enqueue_indirect_dma → indirect-stream ops
LowerMemrefToMlo::lowerWaitDma0x135135e0tpu.wait_dma2sc_tpu.dma_wait
LowerMemrefToMlo::getVerifiedDmaShapes0x13509dc0extract + verify the DMA shapes
LowerMemrefToMlo::checkShapeTileAlignment0x135053a0verify shape-tile alignment for cross-engine DMA
LowerMemrefToMlo::convertDeviceIdFromSubsliceToFullSlice0x13505a40logical → physical core id for the routing key

kStream vs kDma — Choosing the Channel

Purpose

Before lowering a transfer, the compiler classifies it as a stream (SC stream engine, high random-access bandwidth, the gather/scatter datapath) or a DMA (the regular DMA engine, high contiguous bandwidth, the cross-engine VMEM handoff). The SC→MXU feed is a kDma: it is a contiguous tile crossing the engine boundary, not an indexed gather.

The Classifier

// xla::tpu::sparse_core::GetTransferKind  @ 0x1351b140
//   demangled tail "...MemorySpaceES8_bbbb" == (Target&, MemorySpace, MemorySpace, b, b, b, b)
TransferKind GetTransferKind(
    const xla::jellyfish::Target& target,
    mlir::sparse_core::MemorySpace src_mem,
    mlir::sparse_core::MemorySpace dst_mem,
    bool is_indirect,
    bool is_atomic,
    bool is_remote,
    bool is_tile_scope);   //  -> TransferKind::kStream | TransferKind::kDma

Decision rule (from the assertion-string anchors transfer_kind == TransferKind::kStream and == TransferKind::kDma, both present in the lowering decompile):

TransferKindWhy
Indirect (gather / scatter), id-list drivenkStreamstream engine has higher random-access bandwidth
Atomic (scatter-add into HBM)kStreamthe in-HBM FP-add the MXU cannot do
Embedding-style HBM traffickStreamsingle-row random access
Contiguous bulk transferkDmasequential, the regular DMA engine
Cross-engine SC↔TC via VMEMkDmathe operand-feed tile is contiguous
Host (IOVA) transferkDmahost staging

GOTCHA — the getTransferKind<EnqueueDMAOp> (0x135114a0) and getTransferKind<WaitDMA2Op> (0x135145e0) member-templates are per-op-kind classifiers that delegate to the free GetTransferKind; the lowering then CHECKs that the chosen kind matches the producing source path (the two transfer_kind == TransferKind::k* asserts). A reimplementer who emits a VMEM-targeted transfer as a kStream will trip that CHECK — the SC↔TC VMEM feed must be classified kDma. The two channels also use distinct HBM controller policies (random-access prefetch for SC stream, sequential-stream prefetch for TC DMA), so the classification is not cosmetic.


The SFLAG Sync Handshake

Purpose

The data channel moves bytes; the SFLAG channel moves permission. Because SC and TC run asynchronously, the MXU must not latch a VMEM tile until SC has finished writing it, and SC must not overwrite a buffer until the MXU has finished reading it. Both directions are sync flags — a small counter the producer increments and the consumer waits on.

The SFLAG Memory Model

SFLAG is not one register file; it is segmented at the sub-engine level into three subspaces, confirmed by the verifier assertion that a flag's memory space is one of exactly these three:

sflag_memory_space == mlir::sparse_core::MemorySpace::sflag        // global, cross-CORE (SC<->TC)
                   || mlir::sparse_core::MemorySpace::sflag_tile    // per-tile, TEC scope
                   || mlir::sparse_core::MemorySpace::sflag_scs     // per-SCS scope
MemorySpaceScopeCrosses to TC?
MemorySpace::sflagglobal SFLAG pool (cross-engine, cross-core)yes — this is the SC↔TC handshake flag
MemorySpace::sflag_tileper-tile (TEC sub-engine scope)no — SC-internal
MemorySpace::sflag_scsper-SCS (SCS sub-engine scope)no — SC-internal

The region scoping is verifier-enforced, byte-confirmed by two assertion strings present in mlir::sparse_core::MloModuleVerifier::Verify(mlir::memref::StoreOp) (0x146a9da0):

"on smem_tile/sflag_tile is only allowed inside a tile_task or an execute sequencer function"
"on smem_scs/sflag_scs is not allowed inside a tile_task or an execute sequencer function"

So a flag that crosses to the TC must live in the global sflag subspace. The tile-scope and SCS-scope flags are confined to their sub-engine and need an explicit address-space cast (tpu_addrspacecast_sflag_tile_sflag_scs, etc.) even to be shared between SC sub-engines, let alone to reach TC.

TC-Side Ops That Target SC Flags

The TC sequencer (TpuSequencer) issues four mlir::tpu ops that participate in the handshake. All four symbols (SemaphoreWaitOp, SemaphoreSignalOp, SemaphoreReadOp, FetchAndAddSyncOp) are present in the decompile:

TC opMLIR opOperandsRole
tpu.sem_waitSemaphoreWaitOp2wait on (sflag_ptr, threshold) — the MXU-feed gate
tpu.sem_signalSemaphoreSignalOp≥2 (variadic, AttrSizedOperandSegments)increment a flag, optionally targeting a remote SC partition
tpu.sem_readSemaphoreReadOp1 → 1 resultread a flag value without blocking
tpu.fetch_and_add_syncFetchAndAddSyncOp3 (NOperands<3>) → 1 resultatomic add, returns old value (tile counters)

Both tpu.enqueue_dma (EnqueueDMAOp) and tpu.sem_signal (SemaphoreSignalOp) carry a getRemoteDeviceAndSparseCoreIds<T>() specialization — confirming that both DMA and semaphore can address a remote SparseCore by partition id via the DeviceAndCoreIds routing key.

The expandTPUFetchAndAddSync helper (0x134e60c0, in ExpandTiledMemRefsPass) lowers a TC-side tpu.fetch_and_add_sync into an sc_tpu.fetch_and_add when its destination flag is in SC memory — i.e. the SC side is a passive recipient of TC's atomic-add primitive.

The Handshake Sequence

forward (SC produces, MXU consumes):
  1. SC TEC vector-store completes the tile in VMEM (or HBM->VMEM DMA done).
  2. SC sync_set / set_done_bit raises the completion flag in the global `sflag` subspace.
  3. TC tpu.sem_wait(flag, threshold) advances; the MXU latches the tile (matprep.subr).
  4. MXU runs; on result writeback, TC sem_signal the read-done flag.
  5. SC sem_read / sync_wait sees the read-done flag and is free to reuse the buffer.

When the pipeline is balanced, the wait is non-blocking: SC tile N's sync_set completes at TC cycle T+1, TC's sem_wait at T+1 unblocks immediately, the MXU latch for tile N runs at T+2, and SC tile N+1's gather has already started at T+1 — so tile N+1 is resident by the time the MXU asks for it.

Cross-Sub-Engine Sync (SC-Internal, Feeding the Boundary)

Before SC can raise the cross-CORE flag, its own sub-engines must agree the tile is done. The collective-offload emitter does this through OffloadFactory:

HelperAddressRole
OffloadFactory::SyncScsWithTec(builder, value, CoreKind)0x133e9260SCS waits on a flag set by TEC (body issues a SyncWait)
OffloadFactory::SyncTecWithScs(builder, v1, v2, CoreKind)0x133e8fe0TEC waits on a flag set by SCS
OffloadFactory::AllocateSflag(builder, bool set, long count)0x133e8320allocate an SFLAG region (count + initial state)
OffloadFactory::StartLocalDma(...)0x133eb3a0same-chip DMA with src/completion sflags
OffloadFactory::DmaWait(builder, sflag_idx, wait_size, ...)0x133e8180wait for a previously issued DMA

CoreKind parametrizes which sub-engine the helper operates on:

CoreKindMeaning
CoreKind::kScsSparseCore Scalar sub-engine
CoreKind::kTecTile-Execute sub-engine
CoreKind::kTacTile-Access sub-engine (Viperfish / Ghostlite only)

NOTE — on TPU7x (6acc60406, the gfc namespace) there is no TAC, so the SCS↔TEC sync collapses to a single primitive: sc_tpu.tile_wait_scs_smem (TileWaitScsSmemOp, 3 operands = {sflag_ptr, expected_value, smem_buf}NOperands<3u> confirmed). Viperfish and Ghostlite both still carry the full TAC ISA. The TEC waits until SCS has written a designated value to SMEM, then reads SMEM and continues — SCS does the address computation that TAC used to do. This is the TAC-replacement primitive; it does not change the cross-CORE (SC↔TC) handshake, only the SC-internal access→execute synchronization. See TAC Engine and the per-gen section below.


The Contracting-Depth Binding

Purpose

Once the SC tile is resident in VMEM and the sync flag has cleared, the MXU consumes it. The binding question is: what dimension does the matmul reduce over, and is the SC's sparsity visible to the MXU at all? The answer is the cleanest part of the boundary: the SC tile enters the MXU as the ordinary dense moving operand, reduced over the dense MxuContractingSize. The MXU never sees "sparse" — the sparsity was fully resolved by the SC gather upstream.

What the MXU Latches

The MXU side runs the standard op family (MXU Slot): a vlatch loads the stationary weight tile, a vmatprep.subr / .mubr stages the SC-produced VMEM tile as the moving operand into MATPUSH_TARGET_MSRA / MSRB, vmatmul clocks the systolic array, and vmatres drains the accumulator. The activation tile is read from VMEM by the TC's LoadActivationsChunk path; the embedded-lookup result is that activation tile.

SC-produced VMEM tile  --(matprep.subr/.mubr)-->  MSRA/MSRB  --(vmatmul, K steps)-->  accumulator
                                                                    ^
                                                            dense MLP weights (vlatch, stationary)

The Contracting Constants

The contracting depth is a per-codename C++ literal in the Target subclass — not a chip_parts field. The byte-exact values (owned by SparseCoreTarget (Target+0x948), reproduced here as the consumer):

MXU constantv3 Jellyv4 Pufferv5p Viperfishv6e GhostliteTPU7x (6acc60406)Source
MxuContractingSize128128128256256base 0x1D490060; Ghostlite 0x1D497840
MxuNoncontractingSize128128128256256base 0x1D490080; Ghostlite 0x1D497860
MxuSparseContractingSize00000base 0x1D4900C0; no override
MxuContractingSizeIsDoubled(mode)falsefalsepredicatepredicatepredicatebase 0x1D4900A0; VF 0x1D49AA60; GL 0x1D497880

Three facts a reimplementer must encode:

  1. The embedding-feed matmul reduces over the dense contracting dimension (MxuContractingSize — 128 on Jellyfish through Viperfish, 256 on the Ghostlite class, which TPU7x/6acc60406 reuses — there is no separate Tpu7xTarget; only GhostliteTarget overrides MxuContractingSize to 256). The reduction depth is the embedding-dim (or a tile thereof), packed exactly like any other dense activation.
  2. MxuSparseContractingSize is 0 on every generation — no Target subclass overrides it. The sparse-MXU contracting path is unused in this build. The SC/MXU division of labor is at the engine level (SC gathers, MXU does dense matmul), not at the level of a sparse contracting mode inside the MXU. A reimplementer should not look for a sparse-MXU latch on the embedding feed.
  3. The doubling predicate is orthogonal to sparsity. Both ViperfishTarget and GhostliteTarget override MxuContractingSizeIsDoubled with the identical predicate (mode - 22) < 4u — true for raw GainLatchMode ∈ {22,23,24,25}, the 4-bit packed-nibble (S4/U4) matmul modes that pack two nibbles into one physical systolic row (doubling effective contracting depth). This is a dtype-packing feature of the dense MXU and applies to embedding-feed matmuls only insofar as the activation dtype is 4-bit — it is not an SC-specific path.

GOTCHA — the SC SparseCoreLaneCount (Target::SparseCoreLaneCount, 0x0F7906E0 reading the per-gen target descriptor; the runtime tensorflow::GetSparseCoreLaneCount is flag-overridable, so the exact per-gen integer is not a fixed literal) is the vector width SC reduces over, and is not the MXU contracting depth. They are independent: SC's lane count is the TEC vector engine's SIMD width for the gather/reduce upstream; the MXU contracting size (128 / 256) is the systolic row depth the dense matmul reduces over downstream. The handoff is a reshape boundary — SC produces a [rows × embedding_dim] tile in its lane geometry; the MXU re-tiles it into [contracting × noncontracting] for the systolic array. Conflating the two will mis-size the VMEM staging buffer. See SparseCore Architecture for the SC lane geometry and MXU Slot for the systolic tiling.


HLO-Level Boundary Markers

Purpose

The handshake is created at the HLO level: an XlaSparseDenseMatmul custom-call is the single op that, after decomposition, becomes an SC-side gather computation plus a TC-side dense matmul plus the SFLAG sync between them. These are the op names a reimplementer's front-end must emit to trigger the boundary.

The Custom Calls

HLO opWhat it means
XlaSparseDenseMatmulOp (TF op-kernel) / tf.XlaSparseDenseMatmulWithCsrInputsingle-step embedding lookup + dense matmul (the base boundary marker)
XlaSparseDenseMatmulGradOp (TF op-kernel)backward: TC computes the gradient, SC scatter-adds into the table
tf.XlaSparseDenseMatmulWithCsrInput / …WithStaticBufferSizebase, with CSR (or static-buffer-size) sparse-input format
tf.XlaSparseDenseMatmulGradWith{Sgd,Adam,Ftrl,Adagrad,AdagradMomentum}AndCsrInput[Op]backward + optimizer fused
SparseDenseMatmulWithMinibatchingOp (internal HLO)mini-batch variant (subdivides a batch into TILE_SPMEM-fitting windows); decomposed by sparse-dense-matmul-with-minibatching-op-decomposer
tf.XlaSparseDenseMatmulCustomCombinerOnTc[Grad]With…CsrInput / SparseDenseMatmulCustomCombinerTcCombiner{,Megachip}OpTC-side custom combiner (and mega-chip cross-chip form)
GatherMulScatterSparseDenseMatmulOp (internal HLO)gather-mul-scatter fused form
XlaLocalSparseDenseMatmulOp (TF op-kernel) / tf.XlaLocalSparseDenseMatmulsingle-device variant (no SPMD partitioning)

Each is rewritten by its decomposer (the sparse_dense_matmul_* family) into a tuple of {SC-side gather computation, TC-side dense matmul, cross-engine SFLAG sync}. The EmbeddingsPass / EmbeddingBackwardPass recognize the embedding-lookup HLO pattern and produce this custom-call; the EmbeddingDataFormattingDecomposer sets sc.core_type = "sparse" on the SC computation. After decomposition the SC half is partitioned by SparseCoreHierarchicalSpmdPartitioner (with PadSparseCoreProgramInputs / UnPadSparseCoreProgramOutputs aligning the shard boundary) and handed to the SC-MLO pipeline. See SC Backend Pipeline for the pass sequence and SparseCore vs Neuron MatmultSparse for the cross-vendor comparison.

NOTE — the SC↔TC boundary marker is the HLO custom-call, not an MLIR op. By the time the program reaches sc_tpu.* (the SC mid-level IR) the two halves are already separate computations; the only thing tying them together is the SFLAG-keyed sync inserted at the boundaries. A reimplementer working at the MLIR level sees two independent programs — the coupling is the sync-flag use-def chain, which SyncFlagVerifierPass validates.


Partition Assignment — Which SC Feeds Which MXU

Purpose

A chip has several SCs and fewer TCs; the routing key on each TC-side DMA / semaphore selects which SC partition it addresses. The ratio is fixed per generation.

The Ratio

The central constant is xla::jellyfish::lowering_util::SparseCoreCountPerTensorCore(tpu::TpuTopology const*) (0x1C6CB760) = sparse_core_count_per_chip / tensor_core_count_per_chip, guarded by two CHECKs — "sparse_core_count_per_chip >= tensor_core_count_per_chip" and "sparse_core_count_per_chip % tensor_core_count_per_chip == 0" (the integer 4:1 ratio):

GenTCs/chipSCs/chipSCs per TC
Viperfish284
Ghostlite284
TPU7x (6acc60406)144

The 4-SCs-per-TC ratio is preserved across all SC-bearing gens. The four SCs jointly produce the embedding-tile rows consumed by one MXU sequence; each SC's read_register_sparse_core_id (SCS scalar opcode) returns its per-chip id, and IfLocalSparseCore (0x1C6CD000) emits a conditional that runs only when the current SC id matches the designated target (the per-partition gate in SPMD-replicated SC kernels).

The TC-side DeviceAndCoreIds struct (the (logical_device_id, logical_core_id) pair) on each tpu.enqueue_dma / tpu.sem_signal selects the SC; convertDeviceIdFromSubsliceToFullSlice (0x13505a40) maps the logical id to the physical id, and the dialect-level LogicalToPhysicalDeviceIdPass performs the conversion before lowering to Mlo.

GOTCHA — the canonical assignment (SC 0→MXU lane 0, SC 1→lane 1, …) is a convention, not a hard wiring — it is the DeviceAndCoreIds value the compiler fills in. A reimplementer must populate the routing key explicitly; there is no implicit SC↔MXU affinity in the hardware. The Jellyfish / Dragonfish / Pufferfish generations have no SparseCore (the presence gate Target::SupportsSparseCore returns false), so this entire boundary is absent on them.


Per-Generation Variations

AspectViperfish (vfc)Ghostlite (glc)TPU7x / 6acc60406 (gfc)
TAC sub-engineyesyesno
SC-internal access→execute syncSCS→TAC→TEC three-waySCS→TAC→TEC three-waySCS→TEC two-way via tile_wait_scs_smem
SCs per chip / per TC8 / 48 / 44 / 4
DMA channels8 SC + 12 TC8 SC + 12 TC4 SC + 12 TC
MxuContractingSize128256256 (reuses GhostliteTarget)
Remote-sflag encoderEncodeRemoteSyncFlagAddressViperfishxla::ghostlite::dma_utils::EncodeRemoteSyncFlagAddresssame Ghostlite helper
Dual-channel sync (AddBothSyncFlag …)yesyesno
Yieldable sync family12 ops12 opsno

The cross-CORE handshake protocol is identical across the three SC gens — global sflag, tpu.sem_wait, double-buffer. The deltas are: (1) the gfc generation (TPU7x / 6acc60406) drops TAC and folds its address/DMA-issue role into TEC, replacing the three-way SC-internal sync with tile_wait_scs_smem; (2) both Viperfish and Ghostlite carry the dual-channel sync family (Add/SetBothSyncFlag, Add/SetOtherSyncFlag) and the 12 yieldable-sync ops (SparseCoreScalarMisc_YieldableSync{Done,NotDone,Equal,NotEqual,Greater,Less,…}), both of which gfc drops; (3) the remote-SFLAG-address bit-encoding is per-gen, dispatched through a RemoteSyncFlagEncoderRegistry keyed on TpuVersion.

NOTE — CoreKind::kTac is still a registered enum value on TPU7x for binary compatibility, but the gfc emitter never produces it — confirmed by zero gfc::isa::SparseCoreTac* symbols (against the full TAC ISA surface in both vfc::isa::SparseCoreTac* and glc::isa::SparseCoreTac*). A reimplementer targeting TPU7x must route the gather-issue and access→execute sync through TEC + tile_wait_scs_smem, not through a TAC path.


Bandwidth and Back-Pressure at the Boundary

SC↔TC and SC↔HBM traffic share the on-chip HBM controller, arbitrated on a credit basis. The relevant primitives:

MechanismSymbol / opRole
Per-channel DMA creditsc_tpu.set_dma_creditTC has 12 DMA channels, SC has 4; credits prevent monopolizing the HBM bus
Throttle on sflag rangesc_tpu.set_dma_throttle_sflag_rangeback-pressure the SC stream engine when HBM is saturated
Cost-model bandwidth sharebackend_config_util::GetSparseCoreHbmBandwidthAdjustmentFactorper-HLO-op fractional HBM-bandwidth budget
Concurrent-offload limitFLAGS_xla_tpu_sparse_core_offload_queuing_overlap_limitmax SC-offload tasks in flight (per-gen tuned)
Channel policychip-config protos (*_chip_configs_legacy_sparse_core.binarypb)random-access prefetch for SC, sequential-stream for TC

The two engines use distinct HBM MMU page-prefetcher policies — random-access for SC (single-row gather), sequential-stream for TC (contiguous tile fetch) — over the same DRAM. This is why the kStream / kDma classification matters beyond the bundle slot: it also selects the HBM channel policy.


Error Handling at the Boundary

FailureMechanismDetail
MXU starvation (TC waits for SC)tpu.sem_wait finite timeoutFLAGS_xla_tpu_debug_sc_sflag_wait_timeout_ms; on timeout TC issues a controlled halt + writes a diagnostic record naming the source HLO instruction
SC stallSCS Halt opcodecontrolled halt with a status code; no interrupt — the host polls the sequencer status register; the SDC checker keeps running
Sync-flag use-def errorSyncFlagVerifierPass (compile-time)every wait must have a matching producer; every producer must count a threshold for ≥1 consumer
Wait-stat telemetrySflagWaitInstrumentationPassinserts wait-statistics counters at SC↔TC boundaries (xla_tpu_collect_sflag_wait_stats_filter)
SDC mismatchTPU_MEMORY_RESERVATION_TYPE_SPARSE_CORE_SDC_CHECKER_REPORT_SYNC_FLAGdedicated SFLAG range the SDC checker sets on a row-checksum mismatch

The verifier can be opted out per-kernel via the sc.disable_before_use_sync_flag_verification / sc.disable_after_use_sync_flag_verification HLO attributes (for hand-written kernels). xla_sc_conservative_sflag_aliasing forces the verifier to disallow any SFLAG-range aliasing — a debugging mode for stuck pipelines.


Limits and Open Items

ItemStatus
40 tpu_dma_*_sc_* ops; only 3 name vmem; general-form onlyenumerated in decompile
GetTransferKind signature (Target&, MemorySpace, MemorySpace, 4×bool)demangled symbol + kStream/kDma asserts
SFLAG 3-subspace model + region-scoping assertionsassertion strings read
TC-side Semaphore{Wait,Signal,Read}Op + FetchAndAddSyncOpsymbols present
SyncScsWithTec / SyncTecWithScs / AllocateSflag(.,false,2) double-bufferfunction bodies located; SyncWait in body
MxuContractingSize 128/256; MxuSparseContractingSize = 0per-codename literals (see target descriptor)
DeviceAndCoreIds byte layout (routing key)recovered as optional<>; field widths not decomposed
Per-gen SFLAG remote-address bit-encodingencoders located; bit composition not decoded
EmbeddingsPassType enum value listparameter of GetLogicalReplicaInfo; values not exposed as strings
TileTaskOp access-vs-execute region content rulesop present (MloModuleVerifier::Verify(TileTaskOp) 0x146aca00); exact region count and per-region legality predicate not cleanly decompiled
Exact EnqueueDMAOp::getPriority() value spacegetter present; arbitration rules not traced

Cross-References

  • SparseCore Architecture — the SC geometry, the four-tier memory model, and the SparseCoreTarget (Target+0x948) the contracting binding is read from.
  • Stream Gather/Scatter — the intra-SC indirect-DMA datapath (the IndirectStream slot, the per-element address formula) that produces the tile this page hands to the MXU.
  • MXU Slot — the TC MXU op family (vlatch / vmatprep / vmatmul / vmatres), the 128×128 systolic array, and the GainLatchMode doubling modes this page's contracting binding feeds.
  • MATPREP / IAR Latch Slot — the moving-operand staging slot the SC-produced VMEM tile enters through.
  • SparseCoreTarget (Target+0x948) — the byte-exact MxuContractingSize / MxuSparseContractingSize / doubling-predicate table this page consumes.
  • SC Backend Pipeline — the SC-MLO pass pipeline (and the HLO sparse_dense_matmul_* decomposers) that builds the two halves of the boundary.
  • SC Core Selection — how a computation is assigned to a physical SparseCore partition (the DeviceAndCoreIds routing).
  • getSequencerType — the SCS/TAC/TEC selection that picks the engine issuing each transfer.
  • TAC Engine — the Tile-Access sub-engine absent on TPU7x (6acc60406); the tile_wait_scs_smem replacement.
  • SparseCore vs Neuron MatmultSparse — cross-vendor comparison of the gather-then-matmul boundary.
  • SparseCore Overview — the navigational entry for Part IX.
  • Binary: extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so (build-id 89edbbe81c5b328a958fe628a9f2207d)
  • Index entry: Part IX — SparseCore & BarnaCore / SparseCore cross-cutting — back to index