Worker Registration
Every Megascale worker — including the coordinator's own
host-process — registers itself with the coordinator by issuing a
single GetMultiSliceTopology gRPC. The request carries a single
NetworkAddressMapping (the worker's (slice_id, host_id)
identity plus its locally bound network endpoints), the per-slice
TpuTopologyArgsProto, and an incarnation_id that lets the
coordinator detect silent restarts.
The request
Wire path:
/xla.megascale.runtime.MegaScaleTransport/GetMultiSliceTopology
(string at rodata 0x84b9e72).
Message: xla.megascale.runtime.GetMultiSliceTopologyRequest,
package xla.megascale.runtime, syntax proto3. The serialized
FileDescriptorProto lives in descriptor_table_protodef_0ymGVhRfurb
(symbol at 0xbf814f0); the request's descriptor text begins at VA
0xbf81634.
CONFIRMED — field numbers, labels and types decoded byte-for-byte from the descriptor blob at
0xbf81634:address_mapping=18 01 20 01 28 0b,tpu_topology_args=18 02 20 01 28 0b,incarnation_id=18 03 20 01 28 03(type 3 = INT64).
| Field # | Name | Label | Type | Meaning |
|---|---|---|---|---|
| 1 | address_mapping | optional | xla.megascale.runtime.NetworkAddressMapping | This process's (slice_id, host_id) plus the MegaScaleTransport bind point(s). |
| 2 | tpu_topology_args | optional | tpu.TpuTopologyArgsProto | The per-slice topology proto (chip layout, ICI dimensions). Must agree across all hosts of the same slice. |
| 3 | incarnation_id | optional | int64 | Per-process id. Allows the coordinator to detect process restarts within the same (slice_id, host_id) slot. |
NetworkAddressMapping (defined in addresses.proto, descriptor at
0xc1795bd) carries (int32 slice_id [1], int32 host_id [2], repeated HostNetworkAddress addresses [3]). HostNetworkAddress
(descriptor at 0xc17954a) carries (string address [1], string interface_name [2], string host_name_for_debugging [3], int32 numa_node [4]) — there is no port or transport_type field on the
wire; the address is the full endpoint string.
The request is constructed inside
CommunicationBackend::DiscoverTopologyAndAddressBindings( int local_slice_id, TpuTopologyArgsProto args, int local_host_id, int num_slices) (decompiled at 0x1ccacb80). The request is built
field-by-field:
GetMultiSliceTopologyRequest req(/*arena=*/nullptr);
// 1. address_mapping (field 1) — a NetworkAddressMapping default-
// constructed on the request's arena. Its scalar int32 fields
// are stored at offsets +40 / +44 of the sub-message, and the
// addresses[] repeated field is appended into via
// proto2::Arena::DefaultConstruct<HostNetworkAddress>.
NetworkAddressMapping* am = req.mutable_address_mapping();
am->set_slice_id(/*+40*/ ...);
am->set_host_id(/*+44*/ ...);
// addresses[] is populated from the locally bound
// HostNetworkAddress list (built into NetworkAddressMapping v85
// at function entry, then copied/swapped element-wise).
// 2. tpu_topology_args (field 2) — CopyFrom / InternalSwap of args.
*req.mutable_tpu_topology_args() = args;
The request is then logged with VLOG(3) << "Sending topology discovery request: " << req (line 1046) and dispatched through the
transport's virtual send at *(transport->vtable + 40).
INFERRED — the field decompiles as a nested
NetworkAddressMappingwrite (offsets+40/+44) rather than directset_slice_id/set_host_idcalls on the request, becauseslice_id/host_idlive on theaddress_mappingsub-message, not onGetMultiSliceTopologyRequestitself.0x1ccad780isPopulateMockMultiSliceTopologyInfo(the fake-discovery path), not aNetworkAddressMappingconstructor.
The response
xla.megascale.runtime.GetMultiSliceTopologyResponse carries a
single field — the assembled cluster description serialized as bytes:
| Field # | Name | Label | Type |
|---|---|---|---|
| 1 | serialized_topology_info | optional | bytes |
CONFIRMED — descriptor at
0xbf8179a:serialized_topology_info=18 01 20 01 28 0c(type 12 = BYTES) followed by field options42 02 08 01. There is noshared_seed,endpoint_addresses, or top-levelMultiSliceTopologyInfo/...AndLocationProtofield on the response message.
The bytes payload is a serialized MultiSliceTopologyInfo (the
caller parses it via MessageLite::MergeFromString, seen in
DiscoverTopologyAndAddressBindings at line 1064 logging "Received
topology discovery response: "). MultiSliceTopologyInfo (descriptor
at 0xbf816ea) has:
repeated SliceInfo slice_info(field 1),repeated NetworkAddressMapping address_mappings(field 2),int64 incarnation_id(field 3).
Server-side: OnTopologyRequestReceived
The gRPC server stub
WithCallbackMethod_GetMultiSliceTopology (typeinfo name _ZTS...
at 0xb46ad48; the typeinfo object itself is at 0x21c46e50)
dispatches every inbound request to
CommunicationBackend::OnTopologyRequestReceived(req, reply_cb)
(decompiled at 0x1ccac380, source communication_backend.cc).
The handler is short and just forwards to the TopologyCoordinator
through its vtable:
void OnTopologyRequestReceived(
GetMultiSliceTopologyRequest const& req,
AnyInvocable<void(StatusOr<GetMultiSliceTopologyResponse> const&)> cb) {
VLOG(3) << "Received topology request: " << req; // line 937
TracedMutexLock lock(&this->mu_); // backend.+0xe0 (224)
TopologyCoordinator* tc = this->topology_coordinator_; // +0x1a0 (416)
if (tc == nullptr) {
// status #1: INTERNAL "TopologyCoordinator not initialized." (line 1514)
Status err = InternalError("TopologyCoordinator not initialized.");
lock.Release();
LOG(INFO) << "Received toplogy request message with no topology "
"coordinator. This occurs if CreateCommunicator has "
"not been called."; // line 941
// status #2 actually delivered to the caller:
cb(MakeError<UNAVAILABLE/*14*/>(
"Topology Coordinator is not ready. Try later.", line 944));
return;
}
// Forward to the coordinator via its virtual at vtable[+0x10],
// moving the callback in (the request's registration and callback
// bookkeeping happen inside the coordinator).
(*tc->vtable[+0x10])(tc, req, std::move(cb));
}
CONFIRMED — the null path constructs two statuses:
MakeErrorImpl<13>(INTERNAL) "TopologyCoordinator not initialized." at line0x5ea=1514 (call0x1ccac441), and the status actually handed to the callback isMakeErrorImpl<14>"Topology Coordinator is not ready. Try later." at line0x3b0=944 (call0x1ccac49e, invoked via*(cb+0x18)at0x1ccac4ae). The INFO log between them is at line0x3ad=941. All strings live incommunication_backend.ccrodata (0x878fb6f).
The 0x1ccac380..0x1ccac553 disassembly shows:
0x1ccac3a9loadslea 0xe0(%r15)— theTracedMutexatbackend.+0xe0— and0x1ccac3b4constructsTracedMutexLock.0x1ccac3b9loadstopology_coordinator_frombackend.+0x1a0.0x1ccac3c0/0x1ccac3c3(test/je 1ccac426) branch to the "not initialized" path whentopology_coordinator_is null.0x1ccac3dd(call *0x10(%rbx)) moves the inboundAnyInvocablecallback out;0x1ccac3e5stashes its manager into stack slot-0x70.0x1ccac409(call *0x10(%rax)) forwards to the coordinator's virtual at(*tc->vtable)[+0x10].
OnBarrierRequestReceived at 0x1ccac5c0 follows the same
pattern with two differences:
- The lookup goes through the
flat_hash_map<string, unique_ptr<BarrierCoordinator>>atbackend.+0x1b0(432, set up asadd $0x1b0,%r15at0x1ccac5fcbeforeoperator[]at0x1ccac60e), keyed byreq.barrier_id(the request'sbarrier_idstring atreq+0x18). A missing entry triggersoperator[], which constructs a freshBarrierCoordinator( barrier_id, req.num_participants)on the fly (ctor at0x1ccb3fa0); the count is the request's ownnum_participantsfield (req+0x28), not a backendnum_workers_member. - The reply type is
BarrierResponse. The request's(slice_id, host_id, barrier_id)are read fromreq+0x20/req+0x24/req+0x18(used for the VLOG(3) at line 954, "Received barrier request from (SliceId, HostId, barrier_id) ").
What the request carries that the coordinator validates
The coordinator does not blindly merge incoming requests. Inside
TopologyCoordinator::ProcessRequest (0x1cf524c0,
topology_coordinator.cc), req.address_mapping.slice_id keys two
maps on the coordinator object:
flat_hash_map<int, SliceState>at coordinator+232 (0xe8), keyed byslice_id— the per-slice cached topology (insert viaPrepareInsertSmallNonSoo((char*)this + 232, ...)at theFlatHashMapPolicy<int, TopologyCoordinator::SliceState>call).flat_hash_map<tuple<int,int>, NetworkAddressMapping>at coordinator+200 (0xc8), keyed by(slice_id, host_id)— the per-host address mapping.
Validation is strict: a mismatch returns an error Status rather than
silently dropping. Pseudocode reconstruction:
int slice = req.address_mapping.slice_id; // sub-message +40
int host = req.address_mapping.host_id; // sub-message +44
// SliceId bounds check (line 183):
if (slice < 0 || slice >= num_slices_)
return MakeError<INVALID_ARGUMENT/*3*/>(
"SliceId out of bounds. Expected num slices: $0. Request: $1");
// If this slice already registered, compare topology_args:
if (auto* slot = slice_state_.find(slice)) {
MessageDifferencer diff; diff.set_message_field_comparison(EQUIVALENT);
if (!diff.Compare(args_from(req.tpu_topology_args), slot->args))
return MakeError<INVALID_ARGUMENT/*3*/>( // line 202
"Received topology that differs from previously registered "
"topology at same sliceID. SliceID: $0 Previous HostId: $1 "
"New HostId: $2 Addresses: $3 Diff: $4");
// ... and compare the per-host NetworkAddressMapping:
if (!diff.Compare(req.address_mapping, addr_map_[{slice,host}]))
return MakeError<INVALID_ARGUMENT/*3*/>( // line 216
"Received host address mapping that differs from previous "
"mapping SliceID: $0 HostId: $1 Prev Address: $2 "
"New Addresses: $3");
// ... and compare incarnation_id:
if (req.incarnation_id != slot->incarnation[host])
return MakeError<INVALID_ARGUMENT/*3*/>( // line 226
"Received incarnation ID that is different from previous "
"incarnation ID. SliceID: $0 HostId: $1 "
"Prev IncarnationId: $2 New IncarnationId: $3");
}
// HostId bounds check (lines 235 / 250) also returns INVALID_ARGUMENT.
CONFIRMED — every mismatch path ends in
MakeErrorImpl<3>(INVALID_ARGUMENT), not a log-and-drop. Source lines intopology_coordinator.cc: SliceId-OOB 183, topology-diff 202, address-mapping-diff 216, incarnation-diff 226, HostId-OOB 235/250. Theincarnation_idmismatch →MakeErrorImpl<3>matches the documented INVALID_ARGUMENT canon.
The substitute_internal::SubstituteAndAppend format strings live at
rodata 0x9b27486 (topology drift), 0x9c14204 (address mapping
drift) and 0x9c14456 (incarnation drift).
What the request does NOT carry
- No magic / sanity field. The protobuf schema validation is
trusted — there's no equivalent of libnccom's
0x61796c69"ilya" sanity field. A malformed inbound message is rejected by proto2 parsing before reaching the handler. - No authentication token at the application layer. Authentication
is done by the gRPC
ServerCredentialsselected viaMEGASCALE_AUTHENTICATION. Once a peer is authenticated by gRPC, every field in its request is trusted. - No retry counter. Megascale's bootstrap is a single-shot RPC with a deadline. Retry policy is the caller's responsibility (XLA Megascale runtime does not retry GetMultiSliceTopology internally).
- No request-level lease. The registration stays in the coordinator until the response fires; there is no notion of a worker "renewing" its registration. HeartBeat runs after bootstrap on a separate RPC family and serves only liveness, not re-registration.
Cross-References
- Bootstrap Overview — where worker registration sits in the rendezvous sequence and the gRPC server machinery behind it.
- Topology Exchange — the
TopologyCoordinator::ProcessRequestpath that consumes theGetMultiSliceTopologyRequestthis page documents. - Coordinator Election — how the single coordinator process every worker registers against is selected.