ICI Handoff
ICI (Inter-Chip Interconnect) is the dedicated copper / optical fabric that connects TPU chips within a pod. ICI bringup is tpunetd's responsibility; Megascale's bootstrap rendezvous consumes the result of that bringup but never touches ICI itself. This page documents the explicit data-flow handoff and what each side reads / writes.
Three pieces of state that flow tpunetd → Megascale
- Chip coordinates. Each chip in the pod is assigned a
coordinate tuple (x, y, z, …) that the ICI fabric uses for
routing. Coordinates are set by tpunetd's
SessionControl/SetChipCoordinates(orSetGlobalChipIdfor unique pod-wide identification). Workers read their assigned coordinates viaSessionControl/GetChipCoordinatesand incorporate them into theTpuTopologyArgsProtothat goes into the MegascaleGetMultiSliceTopologyRequest.tpu_topology_argsfield. - Routing tables. tpunetd's
SetRoutingTablewrites per-link next-hop tables for every chip; the daemon programs them into the ICI link hardware. Megascale's bootstrap does not transmit routing tables — they are entirely a within-slice concern. Megascale only cares that ICI is up so the resultingTpuTopologyArgsProtois consistent across hosts of the same slice. - Host network bindings. The
HostNetworkAddressentries in the Megascale request carry the host'sMEGASCALE_PORT-derivedaddressplus aninterface_nameresolved over the host's DCN interface (theHostNetworkAddressmessage has exactly four fields:address,interface_name,numa_node,host_name_for_debugging). These are NOT ICI; they are the DCN-side bindings used for cross-slice traffic over normal TCP/gRPC. The ICI fabric serves the in-slice chip-to-chip traffic; DCN serves the slice-to-slice host-to-host traffic.
Sequencing within a single host bringup
host bringup
│
▼
tpunetd_client::TpunetdClient::Init
│ ConnectToTpunetd over /var/google/services/tpunetd/user.socket
│
▼
tpunetd_client::TpunetdControl::StartSession
│ passes per-host DriverInterface ChipLocation vector
│
▼
tpunetd daemon:
│ CreateNetwork (allocate NetworkContextId for this slice)
│ ConfigureIci (program per-chip ICI link config)
│ EnableIciDataLink (turn on the high-speed links)
│ WaitForDataLinkUp (block until every link reaches L1 sync)
│ SetChipCoordinates / SetGlobalChipId (per-chip assignment)
│ SetRoutingTable (per-chip per-link next-hop tables)
│ SetGtcConfiguration / WaitForGtcReset (global time counter sync)
│
▼
tpunetd_client::SessionMaster::BroadcastBarrier
│ Notify + WaitForReady across all hosts of this slice
│ Ensures every host completed StartSession before any
│ host's runtime continues
│
▼
Local runtime: build TpuTopologyArgsProto from
- chip coordinates (GetChipCoordinates result)
- ICI dimensions (TPU_PROCESS_BOUNDS, TPU_HOST_BOUNDS)
- per-host stride (TPU_CHIPS_PER_HOST_BOUNDS)
- twist factors (TPU_TOPOLOGY_ALT, TPU_TOPOLOGY_WRAP)
│
▼
xla::megascale::runtime::CommunicationBackend::DiscoverTopologyAnd
AddressBindings(slice_id, *that TpuTopologyArgsProto*, host_id,
num_slices)
Only at the bottom of this sequence does Megascale's bootstrap
run. By then ICI has been brought up, the chip coordinates are
final, and TpuTopologyArgsProto faithfully describes the local
slice.
How the coordinator uses the handoff
TopologyCoordinator::ProcessRequest (0x1cf524c0) compares
every inbound topology_args against the first one it saw for
the same slice_id:
MessageDifferencer diff;
diff.set_message_field_comparison(EQUIVALENT);
if (!diff.Compare(req.topology_args,
*cached_topology_args_for_slice[slice_id])) {
return MakeErrorImpl<3>( // INVALID_ARGUMENT — hard reject, not a log
"Received topology that differs from previously "
"registered topology at same sliceID. ...");
}
The comparison is a hard reject that returns a non-OK Status
(MakeErrorImpl<3> = INVALID_ARGUMENT), not a log-and-continue — see
Topology Exchange for the byte-level
derivation.
Where the comparison matters:
- Chip count per slice. Two hosts of the same slice reporting
different
TpuDimensionsProto(rows × cols × depth) indicates tpunetd assigned different chip coordinates on those hosts — typically a sign that tpunetd'sWaitForDataLinkUptimed out and the daemon proceeded with a partial slice. - Host bounds. If two hosts disagree on per-host chip count,
the runtime cannot correctly schedule collectives that
partition across hosts. This still surfaces through the same
tpu_topology_argscomparison above — the "Received topology that differs..." error string at rodata VA0x9b27486is returned as an INVALID_ARGUMENTStatus. - Twist factors / topology mode. When
MEGASCALE_TOPOLOGYorFLAGS_megascale_force_use_dcn_topology_ from_flagsinjects a synthesised topology, it must match the one tpunetd actually programmed.
Why ICI is silent here
The Megascale rendezvous never transports ICI link state.
The MultiSliceTopologyInfo message inside
GetMultiSliceTopologyResponse.serialized_topology_info carries:
slice_info(repeatedSliceInfo), where eachSliceInfohas onlyslice_idandtpu_topology_args(chip layout) — agreed upon, not the link state itself.address_mappings(repeatedNetworkAddressMapping), one entry per(slice_id, host_id)pair; each carriesaddresses(repeatedHostNetworkAddress, the DCN endpoints).incarnation_id.
No ICI link tables, no routing tables, no link health bits travel
through Megascale. ICI's health is monitored by tpunetd's
SessionMaster::CheckSessionHeartbeat (0x1ffa6180), which
detects link failures; the failing path runs through
SessionMaster::HandleFailingSession(SessionState), and the
SessionState enum's failure value is SESSION_STATE_FAILING
(see tpunetd Relationship). Such failures
propagate up to PJRT through the tpunetd path, NOT the Megascale path.
The architectural separation means:
- A single bad ICI link makes the slice fail; tpunetd detects it and surfaces it as a session failure.
- A bad DCN link between two slices makes the multi-slice job fail; Megascale's HeartBeat detects it and surfaces it as a heartbeat failure.
The two detection systems run independently — both can fire, neither blocks the other.
What about TpuTopologyArgsProto?
This proto is the only structured artefact that crosses the
tpunetd → Megascale boundary. It is defined in
platforms/deepsea/software/superpod/routing/common/proto/topology.proto
(CONFIRMED — that path is a rodata string). The reflection
metadata in the binary exposes its scalar fields variant and
chip_config_name plus a nested SubSlice message
(tpu.TpuTopologyArgsProto.variant,
tpu.TpuTopologyArgsProto.chip_config_name,
tpu.TpuTopologyArgsProto.SubSlice are all CONFIRMED rodata
descriptor strings).
INFERRED — the structural geometry (chip count per ICI axis,
per-host bounds, process bounds, and twisted-torus twist / wrap /
alt parameters) is populated from the tpunetd-programmed topology
and the TPU_PROCESS_BOUNDS, TPU_HOST_BOUNDS,
TPU_CHIPS_PER_HOST_BOUNDS, TPU_TOPOLOGY_ALT,
TPU_TOPOLOGY_WRAP env vars (all CONFIRMED as binary strings).
These drive the chip layout the compiler uses to route collective
routes correctly.
When --megascale_force_use_dcn_topology_from_flags=true is set,
the runtime constructs TpuTopologyArgsProto entirely from the
MEGASCALE_TOPOLOGY env-var override, bypassing tpunetd's
contribution. This is a testing / simulation path; production
deployments leave it off and let tpunetd drive the proto.
When the handoff goes wrong
Three concrete scenarios:
- tpunetd partially failed
WaitForDataLinkUp. The daemon continued with a degradedTpuTopologyArgsProtoreflecting the smaller working chip set. The Megascale coordinator sees consistent (but wrong) topology across all hosts of the slice. Bootstrap succeeds but downstream compiles will fail to schedule the missing chips. - Two hosts of the same slice raced and got different chip
coordinates. Megascale's
MessageDifferencercatches the mismatch andProcessRequestreturns theReceived topology that differs...MakeErrorImpl<3>(INVALID_ARGUMENT)Status. The second host'sGetMultiSliceTopologyRPC fails with that error; the rendezvous does not converge for that slice, so bootstrap returns a non-OK Status to the runtime rather than silently running with a mismatched chip set. - A host restarted after the slice was already in steady
state and reconnected with a different
incarnation_id.TopologyCoordinator::ProcessRequest(0x1cf524c0) returns the "Received incarnation ID that is different from previous incarnation ID..."MakeErrorImpl<3>(INVALID_ARGUMENT)Status(rodata VA0x9c14456). Separately, the inlinedLogUniqueIdshelper (folded intoCommunicator::Createat0x1cca9aa0) tracks the threelast_idsslots underunique_id_mutexand logs when any slot's id changes. tpunetd'sSessionMasterindependently detects the restart via heartbeat and may transition the session toSESSION_STATE_FAILING. In either case the operator sees both signals and can correlate.
The combined diagnostic surface is therefore: tpunetd answers "is ICI healthy on this host?", Megascale answers "do all hosts of all slices agree on the chip layout?", and the ErrorReporter / RapidEye chain (see Error Aggregator) answers "what was the worst-case combination of failures across the fleet?".
Cross-References
- tpunetd Relationship — the per-slice daemon that owns ICI bring-up before Megascale starts
- Bootstrap › Overview — where the ICI→DCN handoff sits in the lifecycle
- Failure Handling — the Megascale-side failure path that runs alongside the tpunetd one
- Error Aggregator — the fleet-wide root-cause chain referenced above
- ICI › Overview — the in-slice optical torus tier this page hands off from