Tensor-Split / ND-Plane
Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build-id
89edbbe81c5b328a958fe628a9f2207d, buildlibtpu_lts_20260413_b_RC00;.textVMA == file offset0xe63c000,.rodataVMA == file offset0x84a0000). Other versions differ.
Abstract
When an embedding-class collective is offloaded to the SparseCore (the SC-offload substrate of On-Pod Collectives §1.2), the substrate must answer two geometric questions before it can build a ring schedule: how many SparseCores participate and how the tensor is cut across them, and which torus axes the collective's replica-groups actually project onto. This page owns those two derivations. The first is the device partition: NumScOffloadDevices (the participating-SC count), the tensor_split_factor (an in-collective tensor cut), and the per-axis ring device split inside GetDimensionRings. The second is the ND-plane geometry: ExtractNDPlaneInfo, the NDPlaneInfo / NDPlaneStrideInfo struct it returns, and the IsNDPlaneSpanAcrossEntireDimension projection that validates each axis stride.
The ND-plane is the SparseCore analog of the dense TensorCore ReplicaGroupsOnNDPlane decomposition (SelectNDStrategy / Overview §1.3). Where the dense path memoizes a vector<MeshNDInfo> and reports a mesh-dimension count through ReplicaGroupsOnNDPlane(plane=2), the SC path computes an NDPlaneInfo descriptor whose embedded NDPlaneStrideInfo carries a per-axis optional<int32> span stride, and reports the dimension count as popcount of the three has-bits — GetCollectiveNDPlaneDimensionCount. Both count the torus axes the replica-groups span; they differ only in representation.
The two derivations meet downstream. The NumScOffloadDevices total is partitioned multiplicatively across the X/Y/Z ring axes by GetDimensionRings; the NDPlaneInfo dimension count gates the builder's twist / ND-plane-count branches (the cmp $3 3-axis twist gate). Scope boundaries: the physical-core placement of the chosen logical colors lives on Physical-Core Placement, and the strategy choice (sub-plane vs ND-ring vs twisted-torus) on SelectNDStrategy. This page owns the split-factor partition, the NDPlaneInfo struct, and the per-axis projection only.
For reimplementation, the contract is:
- The participating-SC count
NumScOffloadDevices = (TpuTopology SC-AvailableCoreCount / LogicalDevicesPerChip(SparseCore)) − num_embedding_devices, the offload complement of the reserved embedding partition. - The
tensor_split_factoroptional<int>gate: a factor>1requires more than one SparseCore (!use_single_core), and the only supported non-trivial factor is2, which enables split-tensor mode (color duplication). - The per-axis ring device split
segments = extent / devcount − 1, withdevcountcarried as a running remainder across axes — a multiplicative ND partition of the offload-device total. - The
NDPlaneInfo/NDPlaneStrideInfobyte layout, and theIsNDPlaneSpanAcrossEntireDimensionprojection — the per-axis stride must evenly divide the torus dimension extent for the plane to be valid.
| Participating-SC count | NumScOffloadDevices @0x1d6b8b00 → long (flag_utils.cc) |
| Embedding-reserve sibling | NumEmbeddingDevices @0x1d6b8a00 → long (complement partition) |
| Tensor-split gate | ConstructConfigForCollectiveUniDirNDGroups<*> @0x133c82c0 / 0x133c2dc0 / 0x133cd800 (trailing optional<int>) |
| Per-axis ring split | GetDimensionRings @0x133df520 → vector<RingConfigAttributes> (segments = extent/devcount − 1) |
| ND-plane extractor | ExtractNDPlaneInfo @0x133bb940 → StatusOr<NDPlaneInfo> (offload_collective_config.cc) |
| Per-axis projection | ExtractNDPlaneInfo::$_0 @0x133bf700 (the IsNDPlaneSpanAcrossEntireDimension RetCheck triple) |
| Dimension count | GetCollectiveNDPlaneDimensionCount @0x133bb6e0 = popcount(has_x + has_y + has_z) |
NDPlaneInfo size | ~0x28 (3 × int32 + NDPlaneStrideInfo at +0xc) |
NDPlaneStrideInfo size | 0x1c (3 × optional<int32> + bool across_cores_on_chip) |
| Torus extents | Target[+0x3b8][+0x58] (X) / +0x5c (Y) / +0x60 (Z) — shared with dense picker + cost model |
1. The SC offload device partition
The device partition answers "how many SparseCores does the offloaded collective run on, and how is the tensor cut across them" in three independent pieces: the participating-SC count (NumScOffloadDevices), the in-collective tensor cut (tensor_split_factor), and the per-axis ring device split (GetDimensionRings). All three are byte-traced.
1.1 NumScOffloadDevices — the participating-SC count
NumScOffloadDevices(ObjectView<TpuCompEnv>, const TpuTopology&) → long (@0x1d6b8b00, source flag_utils.cc) computes the count of SparseCore logical devices available for collective offload. It is the complement of the embedding reservation: the topology's total SparseCores divided by SparseCores-per-chip gives the per-device SC count; subtracting the reserved num_embedding_devices leaves the offload-available count.
function NumScOffloadDevices(compEnv, topo): // sub_1D6B8B00
sc_total = topo[+0x94] // SC AvailableCoreCount (core type 2)
ldpc_sc = TpuTopology::LogicalDevicesPerChip(topo, 2) // sub_20AD3020, SparseCore
if ldpc_sc > 0:
sc_dev = sc_total / ldpc_sc // signed idiv
else:
sc_dev = 0
auto = AutoOr<long>::FromProtoOrDie(compEnv[+0x898]) // sub_1092F7E0 ("num_embedding_devices")
n_emb = auto.engaged ? auto.value : 0 // dl bit0 = engaged
CHECK(n_emb >= 0) // "num_embedding_devices >= 0", line 1803 — FATAL
CHECK(sc_dev >= n_emb) // "num_embedding_devices <= sc_per_device", line 1805
return sc_dev - n_emb // = NumScOffloadDevices
The TpuTopology[+0x94] field is the SparseCore (core type 2) AvailableCoreCount. It is the SC slot of the per-core-type triple TpuTopology[+0x7c + coretype·12] (TC at +0x7c, an intermediate type at +0x88, SC at +0x94) that Target::CoresPerChip (@0x1d615b40) reads; it is confirmed as the SC core count independently by TpuTopology_MaybeAvailableSparseCoresPerLogicalDevice (@0xf6a1ea0), which for core type 2 calls NumEmbeddingDevices (@0x1d6b8a00).
NOTE — the FATAL
CHECKsource lines read decimal in the decompile —1803and1805— which are0x70band0x70d. The two CHECKs bracket the result into a valid range:num_embedding_devicesmust be non-negative, and it must not exceedsc_dev(so the offload complement cannot go negative). On either violation the process aborts with the diagnostic"Invalid number of embedding devices specified"(.rodata @0x871a7c3).
The sibling NumEmbeddingDevices (@0x1d6b8a00) shares the same sc_dev computation but returns the reserved count: n_emb = engaged ? value : sc_dev (defaulting to all of sc_dev when the flag is unset), bounded 0 <= n_emb <= sc_dev (CHECK lines 0x6f9 / 0x6fb). The two functions partition sc_dev exactly: num_embedding_devices (reserved) + NumScOffloadDevices (offload) = sc_dev.
GOTCHA —
NumScOffloadDevicesis a count of SparseCore logical devices (sc_total / ldpc_sc), not raw cores. A reimplementation that subtractsnum_embedding_devicesfrom the raw core countTpuTopology[+0x94]will over-count on a megacore SparseCore (ldpc_sc == 2). The division byLogicalDevicesPerChip(SparseCore)must precede the subtraction.
1.2 tensor_split_factor — the in-collective tensor cut
The templated builder ConstructConfigForCollectiveUniDirNDGroups<*> takes a trailing optional<int> tensor_split as its last stack argument ([rbp+0x10], copied to [rbp-0x108] / [rbp-0xf8]). Source offload_collective_config.cc. The gate is small and entirely error-driven:
function ConstructConfigForCollectiveUniDirNDGroups(..., optional<int> tensor_split):
split = tensor_split.has_value() ? tensor_split.value : 1 // cmovne, kDefaultTensorSplitFactor = 1
if split >= 2: // cmp eax, 2 ; jl skip
RetCheck( !use_single_core.value_or(kDefaultUseSingleCore) ) // line 0x650
// fail → "A larger than 1 tensor split factor requires more than one
// sparse core to split the tensor on."
RetCheck( split == 2 ) // "tensor_split_factor.value_or(...) == 2", line 1558
// fail → "We currently only support tensor split factor of 2 across two sparse cores."
set tensor_split_mode = 2 // [rbp-0x234] = 2 ; flag bit `or [..], 0x20`
VLOG("Adopting split tensor mode.") // line 0x65c
// mode-2 effect: "Twisted torus: duplicate colors as indicated by tensor split factor."
The factor lands in the OffloadConfig proto field 5 tensor_split_factor ([variant+0x1c], see SC-Offload Config Builder) via the GenerationOption feed. The optional is forwarded unchanged from the public ND wrappers:
| Wrapper | Address | Forwards tensor_split? |
|---|---|---|
ConstructConfigForAllReduceUniDirND | @0x133c2c80 | Yes — push [rbp+0x10] @0x133c2d01 |
ConstructConfigForReduceScatterUniDirND | @0x133ccbe0 | Yes — optional<int> in signature |
ConstructConfigForAllGatherUniDirND | @0x133c76c0 | No — no optional<int> parameter; receives the default empty optional |
QUIRK — AllGather cannot tensor-split in this build. Its public wrapper
ConstructConfigForAllGatherUniDirND(@0x133c76c0) has nooptional<int>parameter, so its templated builder instantiation always sees the default-1 factor and thesplit >= 2gate never fires. Only AllReduce and ReduceScatter expose the knob. A reimplementation that plumbstensor_splitinto the AllGather path is wiring a knob the binary leaves dead.
GOTCHA — the only supported non-trivial factor is exactly
2. The second RetCheck ("tensor_split_factor.value_or(kDefaultTensorSplitFactor) == 2", source line 1558) rejects 3, 4, … with"We currently only support tensor split factor of 2 across two sparse cores."The two-core split is the SC analog of the dense TensorCore megacore data-split; mode-2's effect is to duplicate colors ("Twisted torus: duplicate colors as indicated by tensor split factor."), splitting the tensor across the two cores rather than halving each ring's volume. The per-color emission keyed by this mode is owned by Physical-Core Placement §5 (TensorSplitPerCoreClassifier).
1.3 GetDimensionRings — the per-axis ring device split
GetDimensionRings(const Target&, IciStrategyRingDim, int devcount, bool, bool megacore_aware) → vector<RingConfigAttributes> (@0x133df520) is the partitioner that consumes the running device count and produces a per-axis ring decomposition. The IciStrategyRingDim (1..7) selects which torus axis's extent to read; the devcount is how many devices remain to be partitioned along this and later axes.
function GetDimensionRings(target, ring_dim, devcount, b, megacore_aware): // sub_133DF520
validate ring_dim in 1..7 // lea -7 ; cmp 0xfffffff9 ; jump table @0xae2eaac
extent = target[+0x3b8][ X=0x58 | Y=0x5c | Z=0x60 ] // chip torus dim extent
set mesh/torus flag [-0x48] per switch arm
ldpc_tc = Target::LogicalDevicesPerChip(target, 0) // sub_1D615B00 (TensorCore)
megacore = (ldpc_tc >= 2) && megacore_aware // setge AND'd with arg
// THE SPLIT — how many ring segments this axis carries:
segments = extent / devcount - 1 // idiv ebx=devcount @0x133df670 ; dec eax @0x133df672 ; mov [-0xb8] @0x133df674
... // build RingConfigAttributes from segments + flags
The flat-path caller (the AllGather builder, @0x133c8d3e..@0x133c8d5c) passes the running remaining device count in rcx = [rbp-0x150], initialized from the per-axis deque tuple's hi field. The quotient extent / devcount becomes the next axis's devcount, so the offload devices are partitioned multiplicatively across the X/Y/Z ring axes — each axis consumes devcount devices, and the remainder flows to the next. The cumulative products accumulate in a std::__tree (set) rooted at [rbp-0x228] — the device-offset / color-index map. NumScOffloadDevices is fetched in rbx immediately before this tree is initialized (@0x133c89d3..@0x133c89e3, via GetTpuCompEnv @0x1d73de80), bounding the total devices the per-axis rings may consume.
NOTE — the
−1insegments = extent / devcount − 1is the off-by-one that converts a count of devices-per-ring into a count of inter-device hops (a ring of N devices has N−1 forward steps before wrap). It is byte-confirmed at@0x133df670–@0x133df674(idiv %ebxthendec %eaxthenmov %eax,-0xb8(%rbp); decompile(int)v12 / v10 - 1), wherev12holds the axis extent andv10the device count.
2. ExtractNDPlaneInfo — the ND-plane geometry
ExtractNDPlaneInfo projects a collective's replica-groups onto the physical torus and reports, per axis, the span stride — the spacing between the chip coordinates the group touches along that axis. This is the SparseCore counterpart to the dense ReplicaGroupsOnNDPlane decomposition; it produces an NDPlaneInfo descriptor instead of a vector<MeshNDInfo>.
2.1 The algorithm
function ExtractNDPlaneInfo(target, device_assignment, hlo, Span<vector<int>> groups) // sub_133BB940
-> StatusOr<NDPlaneInfo>: // sret in rdi
// (a) read chip torus extents (same offsets as the dense picker + cost model)
X = target[+0x3b8][+0x58]; Y = target[+0x3b8][+0x5c]; Z = target[+0x3b8][+0x60]
// (b) collect the distinct chip coordinate per axis
for each member core_id in each replica group:
loc = FromGlobalCoreId(target, core_id) // sub_133B7BC0 → TC core location
coord = loc / LogicalDevicesPerChip // idiv → chip-relative coordinate
binary-search-insert coord into per-axis sorted-unique list // memmove @0x133bbb30
// (c) per axis (X, then Y, then Z) run the span lambda
for axis in {X, Y, Z}:
stride_slot = $_0(extent_axis, coord_set_axis, axis_name) // sub_133BF700
// (d) assemble the sret NDPlaneInfo: 3 ints + NDPlaneStrideInfo at +0xc
return NDPlaneInfo{ size_x, size_y, size_z, NDPlaneStrideInfo{...} }
The axis-name strings ("X" @0x8a106a1, "Y" @0x8a0f71b, "Z" @0x886531a) are passed to the $_0 lambda purely for the diagnostic messages on a failed span check. The reduction-mod-LogicalDevicesPerChip collapses each global core ID to its chip coordinate, so two cores on the same chip (megacore) map to the same coordinate and contribute one entry to the sorted-unique set.
2.2 ExtractNDPlaneInfo::$_0 — the per-axis projection (IsNDPlaneSpanAcrossEntireDimension)
The $_0 lambda (@0x133bf700) computes one axis's span stride and validates that the plane spans the entire torus dimension. It writes into a NDPlaneStrideInfo slot: an int32 size at +8 and a bool has_size at +0xc (lambda-this-relative), plus the StatusOr ok flag at +0.
function ExtractNDPlaneInfo::$_0(this, dim_size, coords, axis_name): // sub_133BF700
if coords.size() == 1: // single coordinate ⇒ degenerate
this[+8] = 0 // has_size = 0 (no span on this axis)
this[+0xc] = 0
this[+0] = 1 // ok
return
stride = coords[1] - coords[0] // [-0x48]
RetCheck( stride >= 1 ) // line 922; "Stride must be larger or equal to 1."
RetCheck( stride < dim_size ) // line 923; "Stride must be less than the dimension size."
RetCheck( dim_size % stride == 0 ) // line 925; "Stride must divide the dimension size." ← idiv @0x133bf7b4
// consistency: every adjacent pair must share the stride
for k in 2 .. coords.size()-1:
if coords[k] - coords[k-1] != stride:
return FailedPrecondition(
"All devices within a group must have the same stride along the "
"dimension %s. Expected stride: %d but got %d.", axis_name, stride, observed) // line 931
this[+8] = stride // size = stride
this[+0xc] = 1 // has_size = 1
this[+0] = 1 // ok
The three RetChecks are the IsNDPlaneSpanAcrossEntireDimension gate. The crucial one is the third: dim_size % stride == 0. A plane is a valid ND-plane only if its per-axis stride evenly divides the torus dimension extent — i.e. the strided coordinate set wraps cleanly around the entire torus axis. A stride that does not divide the extent would leave a partial ring, which the SC offload substrate does not emit.
GOTCHA — the lambda does not merely read
coords[1] - coords[0]; it then verifies that every adjacent pair in the sorted coordinate set has the identical stride (source line 931, error"All devices within a group must have the same stride along the dimension %s."). A group whose members are non-uniformly spaced along an axis — a valid HLO replica-group but not a clean torus plane — is rejected here, not silently approximated by the first pair. The raw single-pair reading is necessary but not sufficient; the full per-pair scan is the real contract.
NOTE — the source lines confirm byte-exact against the raw findings:
stride >= 1at922(0x39a),stride < dim_sizeat923(0x39b),dim_size % stride == 0at925(0x39d). The user-facing string for the second check is"Stride must be less than the dimension size."and for the third"Stride must divide the dimension size."
2.3 NDPlaneInfo / NDPlaneStrideInfo struct layout
The two structs are pinned by their ToString methods (NDPlaneInfo::ToString @0x10fdf2a0, NDPlaneStrideInfo::ToString @0x10fe62e0) and independently cross-checked by the field reads in the two consumers, GetCollectiveNDPlaneDimensionCount (@0x133bb6e0) and GetMinorToMajorOrder (@0x133c1c40).
| Struct | Offset | Type | Field |
|---|---|---|---|
NDPlaneInfo | +0x00 | int32 | size_x (plane span, X axis) |
+0x04 | int32 | size_y | |
+0x08 | int32 | size_z (also the iteration/plane bound GetMinorToMajorOrder reads at +0x8) | |
+0x0c | NDPlaneStrideInfo | embedded stride descriptor | |
NDPlaneStrideInfo | +0x00 | int32 | stride_x (optional<int32> value) |
(at NDPlaneInfo+0xc) | +0x04 | bool | has_stride_x |
+0x08 | int32 | stride_y | |
+0x0c | bool | has_stride_y | |
+0x10 | int32 | stride_z | |
+0x14 | bool | has_stride_z | |
+0x18 | bool | across_cores_on_chip (megacore / both-cores flag) |
sizeof(NDPlaneStrideInfo) is 0x1c; sizeof(NDPlaneInfo) is ~0x28 (three ints + the embedded 0x1c-byte stride info, rounded up). Relative to the parent NDPlaneInfo, the embedded stride fields sit at: stride_x@+0xc / has_x@+0x10, stride_y@+0x14 / has_y@+0x18, stride_z@+0x1c / has_z@+0x20, across_cores@+0x24.
The NDPlaneInfo::ToString labels (@0x10fdf2a0) are byte-exactly "size_x: ", "size_y: ", "size_z: ", with a trailing "stride_info: " for the embedded NDPlaneStrideInfo; NDPlaneStrideInfo::ToString labels its three fields "stride_x: ", "stride_y: ", "stride_z: ". The three top-level ints are therefore the plane span sizes (size_x, size_y, size_z) and the embedded struct carries the per-axis strides.
QUIRK — the size/stride pairing is the counter-intuitive part. The top-level
NDPlaneInfoints are the plane spans (size_*), while the strides the$_0lambda computes (coords[1] - coords[0]) are stored in the embeddedNDPlaneStrideInfoasstride_*. A reimplementer who conflates the two — writing the stride into the top-level int — will mis-feedGetMinorToMajorOrder, which reads the embeddedstride_*/has_*(@0x133c1c40:has_x@+0x10,size_x@+0xcrelative to theNDPlaneInfo), not the top-levelsize_*.
2.4 The ND-plane dimension count
GetCollectiveNDPlaneDimensionCount (@0x133bb6e0) is the SC analog of the dense ReplicaGroupsOnNDPlane(plane=2).num_mesh_dims. It calls ExtractNDPlaneInfo and reduces the result to a single integer: the number of torus axes the replica-groups span.
function GetCollectiveNDPlaneDimensionCount(target, device_assignment, collective): // sub_133BB6E0
CHECK(collective != nullptr) // line 845
groups = GetPhysicalDeviceGroups(collective, device_assignment)
info = ExtractNDPlaneInfo(target, device_assignment, collective, groups)
if !info.ok: return info.status // AddSourceLocation line 852 / 849
return info.has_stride_x + info.has_stride_y + info.has_stride_z // sum of the 3 has-bytes
The dimension count is literally the sum of the three has_* bytes — byte-confirmed at @0x133bb772..@0x133bb786 (decompile line *((_DWORD *)this + 2) = v26 + v27 + v28;, where v26/v27/v28 are the three has-bits read out of the local NDPlaneInfo at +0x18/+0x20/+0x28). Because each has_* is 0 or 1, the sum is the popcount over the three axes. This count is what the builder's cmp $3 twist gate (3 dims → k_2k_2k / k_k_2k twist) and the AllGather/ReduceScatter ND-plane-count gate (1..4 dims) test.
NOTE — the dense and SC paths compute the same quantity by different routes. The dense
GetCommunicationMultiplier(@0x127a16c0, Overview §3) returnsReplicaGroupsOnNDPlane(plane=2).num_mesh_dims + 1as a link-count divisor; the SCGetCollectiveNDPlaneDimensionCountreturns the barepopcountwithout the+1. The+1is a cost-model convention on the dense side, not a difference in the underlying axis count — both count the torus axes the replica-groups span.
2.5 The dense ReplicaGroupsOnNDPlane projection (MeshNDInfo builder)
The dense (TensorCore) counterpart of §2.1's SC ExtractNDPlaneInfo is ReplicaGroupsOnNDPlane (@0x1c890960). It is the builder the AllGather ND-ring and ReduceScatter selectors call to both decide dimensionality and build the per-axis vector<MeshNDInfo>. The entry function itself contains no coordinate math: it (a) renders the device-assignment to a string and serializes the topology (TpuTopologySerdes::Distill → TpuTopologyArgs::ToProto → tsl::SerializeToStringDeterministic), (b) takes nd_plane_cache_mutex and looks the composite key up in the NDPlaneCacheKey → optional<vector<MeshNDInfo>> cache (@0x225799b8, guarded singleton GetNDPlaneCache), and (c) on a miss invokes the per-axis lambda ReplicaGroupsOnNDPlaneImpl::$_0 (@0x1c896400) once per mesh axis — axis ∈ {0,1,2} passed as 0x100000000 | axis — then stores the resulting optional<vector<MeshNDInfo>> into the cache.
function ReplicaGroupsOnNDPlane(target, device_assignment, device_list, n_dim, b): // sub_1C890960
CHECK(device_assignment != nullptr)
key.dev = render(device_assignment) // "<id>_<d0>,<d1>,…"
key.topo = SerializeToStringDeterministic(Distill(target.topology).ToProto())
key.n_dim = n_dim; key.b = b
lock(nd_plane_cache_mutex)
CHECK(n_dim == 1 || n_dim == 2 || n_dim == 3) // line 1155, group_utils.cc
// "…only supports dimension n_dim = 1, 2 or 3."
if !cache.contains(key):
v = nullopt
for axis in {0, 1, 2}: // built per mesh axis
v = ReplicaGroupsOnNDPlaneImpl::$_0(&state, groups, 0x100000000 | axis) // sub_1C896400
if !v.has_value: break // projection failed ⇒ no plane
cache[key] = v
result = cache[key] // optional<vector<MeshNDInfo>>
unlock(nd_plane_cache_mutex)
return result
The lambda $_0 (@0x1c896400) is where the projection actually happens; it is the dense analog of ExtractNDPlaneInfo::$_0 (§2.2), but it emits a MeshNDInfo ring geometry rather than a single span stride.
function ReplicaGroupsOnNDPlaneImpl::$_0(state, replica_groups, axis_flag): // sub_1C896400
n_dim = state.n_dim // ***(int***) = 1 | 2 | 3
if n_dim == 1:
// trivial 1-D plane: one MeshNDInfo whose single axis lists all group members
return optional(vector<MeshNDInfo>{ MeshNDInfo_1D(replica_groups) })
ldpc = LogicalDevicesPerChip(target, /*TensorCore*/0) // megacore ⇒ 2
// mesh extents seeded from the chip torus (X,Y,Z at target[+0x58]/[+0x60]/[+0x161,…])
if axis_flag & 0x100000000: // a specific axis was requested
a = axis_flag & 3 // 0|1|2
mesh_extent[a] *= ldpc // fold the per-chip sub-cores into this axis
// multi-slice fan-out
num_slices = GetMultiSliceTopology(target) ? GetNumSlices(target) : 1
alloc PerSliceReplicaData[num_slices] // 88-byte stride per slice
for each group:
for each device d in group:
(core_id, slice) = GetMegascalePerSliceCoreIdAndSliceId(target, da, d) // sub_1C8906E0
loc = TensorCoreLocationForLogicalDeviceId(target, da, core_id, nullopt) // sub_1C8904E0
(cx,cy,cz) = loc.chip_coordinates()
idx = mixed_radix_linearize((cx,cy,cz), mesh_extent) // Horner over mesh strides
per_slice[slice].grid[idx] = group_member_ordinal
// per (slice, group) emit one MeshNDInfo by dispatching on n_dim
for each slice, each group:
if n_dim == 2: m = ReplicaGroupForm2DRing(group, …, mesh_extent) // sub_1C88E6E0
if n_dim == 3: m = ReplicaGroupsOn3DPlane(group, …, NDTopologyInfo) // sub_1C8901E0
if !m.has_value: return nullopt // group does not fit an n_dim plane
out.push_back(m)
return optional(out)
Three facts a reimplementer must preserve, all decompile-verified:
- The per-axis sub-core fold. When a specific axis is requested (
axis_flag & 0x100000000), that axis's mesh extent is pre-multiplied byLogicalDevicesPerChip(mesh_extent[axis] *= ldpc, thev253[8*(axis&3)+16] *= v51store). On a megacore TensorCore (ldpc == 2) the chosen ring axis carries both on-chip cores; the other two axes stay at chip granularity. A reimplementation that linearizes against the raw chip extents on all three axes will mis-place the megacore second core. chip_coordinates, not raw core id. Each device is projected throughTensorCoreLocationForLogicalDeviceId→chip_coordinates, then mixed-radix-linearized against the (folded) mesh extents — the same Horner-stylecoord + extent·(…)recurrence the SC path reduces modLogicalDevicesPerChip. The projection is a chip-coordinate placement, not a core-id sort.n_dimdispatch + short-circuit.n_dim == 1returns a trivial single-axisMeshNDInfowithout any ring construction;n_dim == 2dispatches each group toReplicaGroupForm2DRing(@0x1c88e6e0);n_dim == 3toReplicaGroupsOn3DPlane(@0x1c8901e0). Any group that does not fit ann_dim-axis plane makes the helper returnnullopt, which aborts the whole projection (v.has_valuebreak in the entry loop) and is exactly the "device list does not project onto ak-axis plane" signal the AllGather/ReduceScatter selectors test.
NOTE — this resolves the
allgather-nd-ring.md"What Was Not Resolved" entry that placed the projection math at0x1c891402insideReplicaGroupsOnNDPlane. The entry function (0x1c890960) only builds the cache key and dispatches; the actual coordinate-projection body is the per-axis lambdaReplicaGroupsOnNDPlaneImpl::$_0at0x1c896400. Then_dim == 1 || 2 || 3assertion ("…only supports dimension n_dim = 1, 2 or 3.", group_utils.cc line 1155) is the entry function's only inline check.
GOTCHA — the multi-slice path is not optional bookkeeping. When
GetMultiSliceTopologyis set, the lambda allocates onePerSliceReplicaData(88-byte stride) perGetNumSlices, and a device's slot is selected byGetMegascalePerSliceCoreIdAndSliceId(@0x1c8906e0), which returns both a per-slice core id and the slice index. A single-slice reimplementation that ignores the slice index will collide devices from different slices into the same grid cell on a multi-slice (Megascale) topology.
3. How the two derivations meet
The device partition (§1) and the ND-plane geometry (§2) are consumed together when the offload config builder lays out the per-color ring schedule. The table summarizes the data flow.
| Quantity | Source | Role |
|---|---|---|
| SC AvailableCoreCount | TpuTopology[+0x94] (core type 2) | total SparseCores in topology |
LogicalDevicesPerChip(SparseCore) | TpuTopology::LogicalDevicesPerChip(2) @0x20ad3020 | SparseCores per chip (megacore ⇒ 2) |
sc_dev = total / ldpc_sc | NumScOffloadDevices / NumEmbeddingDevices | SC logical-device count |
num_embedding_devices | AutoOr<long> of compEnv[+0x898] | reserved-for-embedding partition |
NumScOffloadDevices | sc_dev − num_embedding_devices | offload-available SC count (builder rbx) |
tensor_split_factor | optional<int> builder arg → proto field 5 | in-collective tensor cut (==2 ⇒ split mode) |
| per-axis ring segments | GetDimensionRings: extent / devcount − 1 | how the ring is cut along each axis |
| ND-plane dimension count | popcount(has_x + has_y + has_z) | torus axes the collective spans (twist gate) |
| per-axis span stride | $_0 lambda: coords[1] − coords[0], stride | extent | NDPlaneStrideInfo stride_x/y/z |
| minor-to-major axis order | GetMinorToMajorOrder @0x133c1c40 (reads NDPlaneStrideInfo + topo[+0xa3]) | ring traversal order across spanned axes |
The NumScOffloadDevices total bounds the running devcount that GetDimensionRings partitions across X/Y/Z; the NDPlaneInfo dimension count selects the builder's twist / ND-plane branch; the tensor_split_factor modifies the per-color emission (color duplication in mode 2). GetMinorToMajorOrder then turns the per-axis NDPlaneStrideInfo span sizes into the minor-to-major axis ordering the per-color rings iterate, gated by the topo[+0x3b8][+0xa3] across_cores_on_chip flag (FATAL "Stride x/y/z should be set for 3D plane." at lines 2295/2297/2299 if a 3D plane is missing an axis stride).
4. Verification notes
Cross-checked against the IDA decompile of
libtpu.sov0.0.40 (build-id89edbbe8…):
NumScOffloadDevices(@0x1d6b8b00) —sc_total = topo[+0x94];ldpc_sc = LogicalDevicesPerChip(2);sc_dev = sc_total/ldpc_sc(idiv,ldpc_sc>0guard);AutoOr<long>::FromProtoOrDie(compEnv[+0x898])engaged-bit; both FATAL CHECKs ("num_embedding_devices >= 0"line 1803,"num_embedding_devices <= sc_per_device"line 1805);return sc_dev − n_emb— all byte-exact. SiblingNumEmbeddingDevices(@0x1d6b8a00) confirmed as the complement partition.tensor_split_factorgate (AllGather builder@0x133c82c0) —value_or(1);cmp 2>= 2gate; RetCheck"!use_single_core.value_or(kDefaultUseSingleCore)"→"A larger than 1 tensor split factor requires more than one sparse core to split the tensor on."; RetCheck"tensor_split_factor.value_or(kDefaultTensorSplitFactor) == 2"(line 1558) →"We currently only support tensor split factor of 2 across two sparse cores."; VLOG"Adopting split tensor mode."; the mode-2 effect VLOG"Twisted torus: duplicate colors as indicated by tensor split factor."— all byte-exact. Forwarding from AR (push [rbp+0x10] @0x133c2d01) / RS confirmed; AllGather wrapper@0x133c76c0confirmed to have nooptional<int>parameter.GetDimensionRings(@0x133df520) — X/Y/Z extents[+0x3b8][0x58/0x5c/0x60];LogicalDevicesPerChip(0)(TC) megacore detect; the splitextent / devcount − 1byte-confirmed at@0x133df670(idiv %ebx) /@0x133df672(dec %eax) /@0x133df674(mov %eax,-0xb8(%rbp); decompile(int)v12 / v10 − 1) — exact.ExtractNDPlaneInfo::$_0(@0x133bf700) —coords.size()==1fast path (has_size=0);stride = coords[1] − coords[0]; the three RetChecks at lines 922/923/925 ("Stride must be larger or equal to 1."/"Stride must be less than the dimension size."/"Stride must divide the dimension size."); the per-pair stride-consistency scan (line 931); thehas_size=1, size=stridesuccess store — all byte-exact.NDPlaneInfo/NDPlaneStrideInfolayout —NDPlaneInfo::ToString @0x10fdf2a0labels"size_x: "/"size_y: "/"size_z: "/"stride_info: ";NDPlaneStrideInfo::ToString @0x10fe62e0labels"stride_x: "/"stride_y: "/"stride_z: "/"across_cores_on_chip: "with reads at+0/+4,+8/+0xc,+0x10/+0x14,+0x18; the two consumersGetCollectiveNDPlaneDimensionCount @0x133bb6e0(has-bits) andGetMinorToMajorOrder @0x133c1c40(has@+0x10/+0x18/+0x20,size@+0xc/+0x14/+0x1c) agree with the ToString-derived layout — byte-exact.GetCollectiveNDPlaneDimensionCount(@0x133bb6e0) —*((_DWORD *)this + 2) = v26 + v27 + v28(sum of the three has-bytes) confirmed; CHECK"collective != nullptr"line 845;AddSourceLocationlines 849/852 — exact.- dense
ReplicaGroupsOnNDPlane(@0x1c890960) + lambdaReplicaGroupsOnNDPlaneImpl::$_0(@0x1c896400) — entry function builds theNDPlaneCacheKey(device-assignment render +TpuTopologySerdes::Distill/ToProto/SerializeToStringDeterministic+n_dim+ bool) undernd_plane_cache_mutex, assertsn_dim == 1||2||3("…only supports dimension n_dim = 1, 2 or 3.", group_utils.cc line 1155), and on a cache miss calls the lambda once per axis with0x100000000 | axis(axis ∈ {0,1,2}) — byte-confirmed. The lambda's per-axis sub-core foldmesh_extent[axis&3] *= LogicalDevicesPerChip(0)(v253[8*(a5&3)+16] *= v51), then_dim==1trivial-MeshNDInfoshort-circuit (**(int**)a2 == 1arm), the per-deviceTensorCoreLocationForLogicalDeviceId(@0x1c8904e0) →chip_coordinates→ mixed-radix linearize, the multi-slice fan-out viaGetMultiSliceTopology/GetNumSlices/GetMegascalePerSliceCoreIdAndSliceId(@0x1c8906e0) with an 88-bytePerSliceReplicaDatastride, and then_dim-dispatch toReplicaGroupForm2DRing(@0x1c88e6e0) /ReplicaGroupsOn3DPlane(@0x1c8901e0) withnullopt-on-no-fit — all decompile-verified.[LOW] Confirmed by structure / label, not by an independent numeric consumer:
- The exact arithmetic the top-level
NDPlaneInfosize_x/y/zints carry (plane extent vs device-count along axis): the labels are byte-read andsize_z(+0x8) is used as the iteration/plane bound byGetMinorToMajorOrder, but the producer arithmetic inExtractNDPlaneInfo's sret tail (@0x133bc674..@0x133bc880) was not individually decoded — the size semantic is inferred from the"size_x: "label and the consumer use.- The unification of
NDPlaneStrideInfo(SC) with the denseMeshNDInfo(TC) — both report a per-axis span, but the field-by-field correspondence was not closed; the SC dimension count usespopcountof the has-bits, the TC usesReplicaGroupsOnNDPlane.num_mesh_dims.
Cross-References
Scope boundaries (this page's neighbors)
- Physical-Core Placement — where the chosen logical colors land on concrete physical SC cores, and the
tensor_split_mode==2per-core emission keyed byTensorSplitPerCoreClassifier - SelectNDStrategy — the dense-substrate ND-strategy choice (sub-plane / ND-ring / twisted / strided), the TC
ReplicaGroupsOnNDPlaneanalog this page'sNDPlaneInfomirrors
SparseCore-offload substrate
- SC-Offload Config Builder — the
*OffloadConfigproto carryingtensor_split_factor(field 5), and theConstructConfigForCollectiveUniDirNDGroups<*>builder this page's gate lives inside - HierarchicalKind — the
AutoOr<bool>flat-vs-hierarchical split the offload builder dispatches on - SC Core-Selection (Offload) —
SparseCoreConfig.offloadop-type classification + core selection
Substrate map + sibling subsystems
- On-Pod Collectives — Section Map — the substrate split, the SC-offload gate, and the shared physical-torus mesh decomposition
- AllToAll Tables — the all-to-all / ragged link tables (
EstimatePhysicalLinksUsed) - back to index