tpunetd Relationship
tpunetd is the per-host TPU control daemon. Its responsibilities
overlap with Megascale's in name only: tpunetd handles
within-slice chip and fabric setup, while Megascale's
MegaScaleTransport handles across-slice address-table
exchange. The two run sequentially and on different transports.
This page describes the data flow between them; for tpunetd's own internals see tpunetd Protocol.
Layering
+----------------------------------------------------------------+
| XLA Megascale Runtime |
| (xla::megascale::runtime::CommunicationBackend) |
| |
| DiscoverTopologyAndAddressBindings(slice_id, args, |
| host_id, num_slices) |
| │ |
| ▼ |
| GetMultiSliceTopology gRPC over MEGASCALE_PORT |
| Coordinator (one process per job) |
+──────────────────────────────────────────────────────────────────+
▲ reads tpu_topology_args (TpuTopologyArgsProto) and the
│ address_mapping (NetworkAddressMapping) derived from the
│ chip coordinates that tpunetd published earlier
│
+──────────────────────────────────────────────────────────────────+
| tpunetd_client (superpod::tpunetd_client::TpunetdClient) |
| |
| TpunetdClient::Init |
| ConnectToTpunetd → Unix socket /var/google/services/... |
| SessionMaster::Create → spawns per-peer SessionWorker |
| stubs over TCP within this slice |
| |
| Issued RPCs (subset libtpu uses): |
| StartSession / StopSession / StatSession |
| CheckSessionHealth / GetChipCoordinates |
| GrantSessionPermission / GetCoreDump |
+──────────────────────────────────────────────────────────────────+
▼ gRPC over Unix socket
+──────────────────────────────────────────────────────────────────+
| tpunetd daemon (out-of-process, owns the chips) |
| |
| ICI fabric setup: CreateNetwork, UpdateTopology, |
| ConfigureIci, EnableIciDataLink, WaitForDataLinkUp |
| Routing table install: SetRoutingTable |
| Global time counter: SetGtcConfiguration, WaitForGtcReset |
| Chip coordinate assignment: SetChipCoordinates, |
| SetGlobalChipId |
+──────────────────────────────────────────────────────────────────+
Sequence: full Megascale job bringup
The dependency runs strictly tpunetd → MegaScaleTransport. The
runtime enforces the order by sequencing the calls inside the
plugin's do_init:
- tpunetd_client connect.
TpunetdClient::Create(topology, TpuType, Options)followed byTpunetdClient::Init(opts, retry_count)opens the Unix socket to the local tpunetd daemon. - Session bringup. Either
SessionControl/StartSession(when--bypass_vbar_control_service=true) orVBARControl/StartSession(default) registers this host's chips. The daemon assigns chip coordinates, programs ICI routing tables, configures the global time counter, brings up ICI data links, and signals back when the session is ready. - In-slice rendezvous (
tpunetd_client::BroadcastBarrier). Thetpunetd_client::BroadcastBarrierclass (BroadcastNotification/BroadcastWaitForReady, gated bySyncWithTimeout) drives thesuperpod.tpunetd_client.proto.TpuNetworkSessionBarrier/Notifyand.../WaitForReadyRPCs across every host in the slice. This is a peer-to-peer barrier; tpunetd itself is not involved. - TpuTopologyArgsProto extraction. Once the in-slice
rendezvous is complete, the runtime queries
SessionControl/GetChipCoordinatesto fetch the assigned coordinates, builds the per-slicetpu::TpuTopologyArgsProto(chip dimensions, host bounds, per-process bounds, twist factors), and stashes it. - MegaScaleTransport bringup.
CommunicationBackend::Createconstructs the backend;transport_factory()produces aGrpcTransport;GrpcTransport::Initbinds theMEGASCALE_PORTserver. The coordinator process additionally instantiatesTopologyCoordinator. - Cross-slice rendezvous.
DiscoverTopologyAndAddressBindings(local_slice_id, args /* from step 4 */, local_host_id, num_slices)runs. This is the entry point documented elsewhere in this section. - Steady state. After Megascale's rendezvous completes,
StartHeartBeat()runs onMegaScaleTransport. The runtime becomes ready to handle PJRT executable launches; tpunetd'sSessionMaster::CheckSessionHeartbeatruns in parallel on the tpunetd transport.
There is no path where Megascale starts before tpunetd. The
enable_megascale_topology=false flag bypasses step 5/6 entirely
(single-slice mode); tpunetd still runs. The reverse —
Megascale without tpunetd — does not happen in the binary.
Data flow through the boundary
The Megascale request (GetMultiSliceTopologyRequest, whose proto
descriptor lives at file offset 0xbf81634) carries the following
state, two pieces of which are derived from tpunetd:
Field of GetMultiSliceTopologyRequest | Derived from tpunetd RPC |
|---|---|
tpu_topology_args (field 2, tpu.TpuTopologyArgsProto) | SessionControl/GetChipCoordinates plus the local TpuTopologyArgs construction in the runtime (uses chip layout fields populated by SetChipCoordinates). |
address_mapping (field 1, xla.megascale.runtime.NetworkAddressMapping) — carries slice_id, host_id, and repeated HostNetworkAddress entries (each a single address string plus interface_name, numa_node, host_name_for_debugging). | Not directly from tpunetd: the address value comes from local resolution governed by the megascale_port / megascale_port_name flags, but the binding depends on tpunetd having assigned the slice_id / host_id chip coordinates that determine which network the host uses. |
incarnation_id (field 3, int) | Process-local; independent of tpunetd. |
The coordinator validates the tpu_topology_args field by running
proto2::util::MessageDifferencer::Compare over every inbound
request (see Topology Exchange). The
implication: two hosts in the same slice that completed tpunetd
bringup with different chip coordinates will be detected at this
boundary — even though tpunetd's own StartSession would have
been consistent. This is one of the failure modes that the drift
warnings at rodata 0x9b27486 are designed to surface.
What tpunetd does NOT do
- No cross-slice address exchange. tpunetd is per-host;
its peer fanout (
SessionMaster::ExecuteOnAllWorkers) is bounded to peers within the same slice. Multi-slice address tables are Megascale's domain. - No cross-slice barrier.
BroadcastBarrieris per-slice; multi-slice synchronization usesMegaScaleTransport.Barrier. - No PJRT integration. tpunetd is invisible to the JAX/TF
client; PJRT talks to libtpu, libtpu talks to tpunetd. The
client never sees a
tpunetd.*proto. - No support for non-Borg / non-Cloud paths. The
kTpunetdSupportedTpuVersionswhitelist limits tpunetd to certain TPU generations; older single-host setups fall back to the in-processSliceBuilderfamily, which has its own per-host rendezvous (different from both tpunetd and Megascale).
Operational consequences
- Operator-visible startup log on a multi-slice TPU pod:
- tpunetd_client side:
"Running in Cloud, using TpunetdClient"or"Creating tpunetdclient for worker ...". - tpunetd_client SessionMaster: peer connection logs.
- Per-slice barrier completes (no specific log; the
SessionMasterproceeds silently). - CommunicationBackend ctor side: per-process FLAGS log line.
- On the coordinator only:
"Megascale Topology Coordinator started for <N>". - Per-process: periodic
"MegaScale Topology Discovery in progress. Missing hosts..."until completion. - Once complete:
"MegaScale Topology Discovery completed."
- tpunetd_client side:
- Operator-visible failure on a stuck slice: the per-slice
tpunetd
SessionMaster::CheckSessionHeartbeatis the first to notice a stalled host; it emits"Session is failing due to the following chips having zero as chip id". If the slice never finishes tpunetd bringup, Megascale's per-host requests never arrive, and the coordinator reports the missing slice in its periodic status log.
The two systems are therefore complementary diagnostics: tpunetd sees per-chip failures; Megascale sees per-slice / per-host absences.
Cross-References
- tpunetd Protocol — the daemon's wire surface this page relates Megascale to
- Bootstrap › Overview — where the tpunetd dependency sits in the lifecycle
- ICI Handoff — the ICI-up → DCN-rendezvous boundary between the two systems
- Fleet Metadata › Slice Shape — the slice geometry tpunetd's
GetChipCoordinatesfeeds into