TPU Buffer Layout
All addresses, struct offsets, and field tags on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (buildlibtpu_lts_20260413_b_RC00, build-id md589edbbe81c5b328a958fe628a9f2207d). The image is not stripped; demangled C++ symbol names are quoted verbatim. Other versions will differ.
Abstract
A PJRT device buffer on a TPU is a flat run of HBM bytes whose internal arrangement is not row-major. Between the logical xla::Shape a user hands to the runtime (a (element_type, dimensions[], layout) triple) and the bytes that actually land in HBM sits one translation, xla::jellyfish::TransferSizeUtil::HostShapeToDeviceShape, which (a) chooses a minor-to-major dimension order, (b) pads the two minor-most dimensions up to the chip's (sublane, lane) tile, (c) packs sub-byte and splits >32-bit element types, and (d) stamps an xla::Tile list onto the layout describing the physical tiling. The padded device shape is what the allocator sizes, what a DMA descriptor addresses, and what ShapeSizeBytesRaw measures. This page documents that translation: the shape↔layout data model, the tile-padding arithmetic, the element-packing rules, and the on-device buffer residency record (xla::ShapedBuffer → XLA_ShapedBuffer) that names which HBM addresses hold a buffer's leaves.
The reader who knows XLA on GPU should hold one analogy and immediately complicate it. On a GPU an xla::Layout is a minor_to_major permutation and little else; the bytes are dense row-major-after-permutation. On TPU the layout additionally carries a tile (xla::Layout::tiles(), a vector of xla::Tile), and the physical buffer is the logical array reshaped into (sublane, lane) tiles laid out tile-major. The lane dimension is 128 across every generation; the sublane dimension is 8 on this v5+/v6e/TPU7x build (an earlier generation used 16) — the same (SublaneCount, LaneCount) tile layout assignment stamps onto every leaf shape. The hardware reads a tile as one contiguous chunk, so a [3, 5] f32 array does not occupy 60 bytes: it pads to [8, 128] (one tile) and occupies one granule, with the trailing lanes/sublanes filled (with 0xFF on the pad path). Wasted padding is real HBM, and minimising it is exactly why the layout-assignment chooser optimises for compact byte size.
This page owns the on-device tiled buffer: the shape→device-shape mapping (HostShapeToDeviceShape / SetPaddedShape), the tile-padding rules (Pad2ndMinorCompact, GetCompactTiles), the element-packing factors, the byte-size functions (ShapeSizeBytesRaw / ShapeSizeCompact / ShapeWithMetadataSizeBytes), and the residency record. It does not reproduce the compile-time choice of minor_to_major (that is layout-assignment.md), the HBM free-list allocator that hands out the offset (that is hbm-allocator.md), the DMA wire alignment (that is hbm-dma-alignment.md), or the PJRT PJRT_Buffer ABI and external refcounting (that is pjrt/buffer-and-memory.md).
For reimplementation, the contract is:
- The shape↔layout data model — the
xla::Shape/xla::Layoutfields the device layout reads, and theXLA_Shape/XLA_LayoutC-ABI mirror with their exact offsets. - The shape→device-shape mapping —
HostShapeToDeviceShape→SetPaddedShape: unpack sub-byte types, recurse, pad the minor dims, stamp the tile. - The tile-padding rules — minor dim → multiple of
LaneCount(else next pow2); 2nd-minor viaPad2ndMinorCompact; thegranule_bytes % sublane_bytes == 0invariant. - The element packing —
ElementPackingFactor(sub-byte pack viatc_max_packing_factor), and the 64/128-bit split (ComponentShape<64>/<128>). - The residency record —
xla::ShapedBuffer(on-device shape + per-leafDeviceAddressBase), mirrored toXLA_ShapedBuffer.
| Host→device shape | xla::jellyfish::TransferSizeUtil::HostShapeToDeviceShape (via HardwareLayout::HostShapeToDeviceShape @ 0xeab0e20) |
| Padded-shape core | TransferSizeUtil::SetPaddedShape @ 0x1d6ae0e0 |
| Tile builder | TransferSizeUtil::GetCompactTiles @ 0x1d6b11c0 (returns inlined_vector<xla::Tile,3>) |
| 2nd-minor pad | TransferSizeUtil::Pad2ndMinorCompact @ 0x1d6af5c0 |
| Element packing | TransferSizeUtil::ElementPackingFactor @ 0x1d6b03e0 (table-driven, kPackingFactors[35]) |
| Padded byte size | TransferSizeUtil::ShapeSizeBytesRaw @ 0x1d6add40; with metadata ShapeWithMetadataSizeBytes (via HardwareLayout::ShapeSize @ 0xeab0ec0) |
| Compact byte size | TransferSizeUtil::ShapeSizeCompact @ 0x1d6ae8a0 (via HardwareLayout::ShapeSizeCompact @ 0xeab0f20) |
| Tile stamp | xla::jellyfish::HardwareLayout::PopulateShape; per-leaf TransferSizeUtil::UpdateLeafLayout @ 0x1d6b08c0 |
| Linearizer | TransferSizeUtil::LinearIndex @ 0x1d6b0600 → xla::LayoutUtil::LinearIndex when tiled |
| HBM repacker | RepackToHardwareLayout<sublane,128> @ 0x1d5c3b40 (<16,128>), 0x1d5c2880 (<2,128>), 0x1d5c3f80 (<32,128>) |
| Residency record | xla::ShapedBuffer → XLA_ShapedBuffer (ApiConverter::ToC @ 0xfcb8580) |
| Default tile | (SublaneCount, LaneCount) = (8, 128) this build; (16, 128) on an earlier generation |
| Confidence | CONFIRMED (byte-anchored) unless a row or callout says otherwise |
1. The Shape / Layout Data Model
Purpose
Everything on this page operates on two C++ objects: xla::Shape and the xla::Layout it contains. A reimplementer must reproduce these fields exactly, because the device-layout code reads them by raw offset and the runtime mirrors them across the C ABI into XLA_Shape / XLA_Layout for the PJRT/stream_executor boundary. The compile-time pass that fills in minor_to_major and the memory_space is layout assignment; this page is the consumer.
The xla::Layout fields the device path reads
xla::Layout is an absl::InlinedVector-rich struct; the device code touches only a handful of members. The C-ABI mirror XLA_Layout (built by ApiConverter::ToC(const xla::Layout&, XLA_Layout*) @ 0xfcb7ca0) pins the offsets, and the on-device functions reach the same data through xla::Shape::layout() (returns a Layout*).
| Field | xla::Layout access | XLA_Layout offset | Meaning |
|---|---|---|---|
minor_to_major[] | layout()[2..3] inlined vector (count at [2]>>1) | +0 (ptr), +48 (count) | Physical dim order, minor-first |
dim_level_types / flags | layout() low bytes | +400, +404 | sparse/dense level kinds |
index_primitive_type / pointer_primitive_type | *((qword*)layout+1) | +408 | sparse index/pointer types |
element_size_in_bits | layout()[1] | +416 | non-zero for packed/odd-width elements; 0 ⇒ use dtype byte size |
memory_space | *((char*)layout+2) | +424 | tier color (HBM/VMEM/…); see overview.md |
tiles[] | layout()[9] is tiles().size() | +56 (ptr), +392 (count) | the physical tile list (each xla::Tile is a dim vector) |
tail_padding_alignment_in_elements | *((qword*)layout+24) | +432 | trailing-element pad quantum |
The single most consulted field is layout()[9] — the tile count. LinearIndex (0x1d6b0600), ShapeSizeBytesRaw (0x1d6add40), and HasLinearLayout (0x1d6b0160) all branch on whether tiles().size() >= 2 (or > 1): a populated tile list means the buffer is tiled and addressed via ShapeUtil::ArraySize / LayoutUtil::LinearIndex; an empty tile list means a linear (untiled) buffer.
The xla::Shape fields
xla::Shape (and its XLA_Shape mirror, ApiConverter::ToC @ 0xfcb7940) carries the dims and the layout. XLA_Shape offsets, decoded directly from the converter:
XLA_Shape (536 bytes per element in a tuple array):
+0 element_type (PrimitiveType, int32)
+8 dimensions[] (int64 array) count @ +56
+64 dynamic_dims[] (bool array) count @ +72
+80 tuple_shapes[] (XLA_Shape array, 536 B stride) count @ +88
+92 has_layout (bool)
+96 layout (XLA_Layout, present iff has_layout)
NOTE — the on-device functions special-case three
element_typefamilies up front, recovered from the validity bitmask0x2FFF91FFEtested with_bittest64(the searchable array types, ≤ 0x21), the mask-0x20-family token-and-opaque types, and the bitmask0x400048000for the 64/128-bit families. The same0x2FFF91FFEmask gates the layout-assignment chooser; it is the canonical "is this a tiled array element type" predicate.
Function Map
| Function | Address | Role |
|---|---|---|
ApiConverter::ToC(const xla::Layout&, XLA_Layout*) | 0xfcb7ca0 | xla::Layout → C ABI; pins layout offsets |
ApiConverter::ToC(const xla::Shape&, XLA_Shape*) | 0xfcb7940 | xla::Shape → C ABI; recurses tuple shapes |
ApiConverter::FromC(XLA_Shape*) | 0xfcb7400 | C ABI → xla::Shape for runtime calls |
xla::jellyfish::TransferSizeUtil::HasLinearLayout(const Layout&) | 0x1d6b0160 | tile list empty ⇒ linear |
2. Host Shape → Device Shape
Purpose
HostShapeToDeviceShape is the entry point that turns a logical host xla::Shape into the physical device shape the TPU stores. It is what the transfer manager, the allocator-sizing path, and the C shim Tpu_OnDeviceShape call. The work is delegated to TransferSizeUtil::HostShapeToDeviceShape, which fans out to SetPaddedShape per leaf and re-stamps the layout. The padded device shape differs from the host shape in three ways: padded minor dims, packed/split element types, and a populated tiles() list.
Entry Point
HardwareLayout::HostShapeToDeviceShape (0xeab0e20) ── C-ABI wrapper
├─ ApiConverter::FromC(host XLA_Shape -> xla::Shape)
├─ GetRegisteredDeepseaPlatform / DeepseaPlatform::GetTopology ── the tpu::TpuTopology*
├─ GlobalTpuCompEnv() ── reads comp-env+4124, +4373 (pack flags)
├─ TransferSizeUtil::HostShapeToDeviceShape(topology, shape, ...)
│ └─ per leaf: SetPaddedShape (0x1d6ae0e0) -> HardwareLayout::PopulateShape
└─ ApiConverter::ToC(device xla::Shape -> XLA_Shape)
Algorithm — SetPaddedShape
SetPaddedShape (0x1d6ae0e0) is the core of the mapping. It is recursive on packed element types and computes, per minor dimension, the padded extent before handing the dims to HardwareLayout::PopulateShape to stamp the tile.
function SetPaddedShape(topology, shape /*buffer leaf*/, out_shape): // 0x1d6ae0e0
REQUIRE(LayoutUtil::HasLayout(shape)) // FATAL "Can't pad a shape without knowing its layout", line 1019
// ---- Step 1: sub-byte / packed element types -> unpack, recurse, re-stamp ----
pack = ShouldPackPREDAsSingleBit(topology, shape) // PRED -> 1-bit packing
if ElementPackingFactor(shape.element_type(), pack) >= 2:
unpacked = GetUnpackedShape(topology, shape) // widen the packed dtype
SetPaddedShape(topology /*unpacked*/, shape, out_shape) // recurse on unpacked
out_shape.set_element_type(shape.element_type()) // restore original packed dtype
ForEachMutableSubshape(out_shape, UpdateLayout) // re-stamp tile per leaf
return OK
pbytes = ShapeUtil::ByteSizeOfPrimitiveType(shape.element_type())
// ---- Step 2: rank<=1 scalar/vector fast path ----
if shape.dimensions().size() <= 1:
CHECK(scalar_layout.minor_to_major().size() == 0) // line 1040
chunk_bytes = 4 * topology.chunk_size_elems // this[53]
CHECK(chunk_bytes % pbytes == 0) // line 1043
out_shape = PopulateShape(et, {chunk_bytes/pbytes}, tile=1, layout)
return OK
// ---- Step 3: only 4-byte (or PRED) element widths are implemented for tiling ----
if shape.element_type() != PRED and pbytes != 4:
return Unimplemented("Attempted to map shape ... to on-device TPU padded shape // line 1050
but this is not implemented")
// ---- Step 4: pad each minor dim up to its chunk bound ----
padded_dims = copy(shape.dimensions())
for d in [0 .. rank): // walk in minor_to_major order
dim_idx = minor_to_major[rank-1 - d]
bound = ChunkBound(topology, d, rank, dim_idx) // 0x1d6b22e0
extent = shape.dimensions(dim_idx)
padded_dims[dim_idx] = round_up(extent, bound) // ceil(extent/bound) * bound
// ---- Step 5: stamp the tile + memory_space ----
out_shape = HardwareLayout::PopulateShape(et, padded_dims, ntiles, layout)
return OK
GOTCHA — Step 3 is a hard wall:
SetPaddedShapeonly implements tiling for 4-byte element widths (and PRED). Anything wider takes the 64/128-bit split path inShapeSizeBytesRaw(§4) before it ever reaches a 4-byteSetPaddedShape; anything narrower is packed up to a 4-byte-equivalent in Step 1. A reimplementation that tries to tile a raw bf16 or s8 buffer directly will hit theUnimplementedat line 1050 — the design forces every element width into the 4-byte tiling kernel via pack/split first.
QUIRK — the rank≤1 fast path (Step 2) does not pad to the 2D
(sublane, lane)tile. A scalar or 1-D buffer is reshaped to a single chunk ofchunk_bytes/pbyteselements (chunk_bytes = 4 * topology.chunk_size_elems). Only rank≥2 buffers get the 2-minor tile. This is why a 1-D vector of N f32 elements rounds up to a chunk multiple, not to a[1, 128]lane tile.
Function Map
| Function | Address | Role |
|---|---|---|
HardwareLayout::HostShapeToDeviceShape | 0xeab0e20 | C-ABI host→device shape |
TransferSizeUtil::SetPaddedShape | 0x1d6ae0e0 | Per-leaf pad + tile stamp |
TransferSizeUtil::GetUnpackedShape | 0x1d6b2100 | Widen a packed dtype for the recursion |
TransferSizeUtil::ShouldPackPREDAsSingleBit | 0x1d6b0080 | PRED 1-bit packing predicate |
TransferSizeUtil::ChunkBound | 0x1d6b22e0 | Per-dim chunk bound used in the pad round-up |
HardwareLayout::PopulateShape | 0x1d6da360 | Stamp dims + tile + memory_space |
XlaShapeToTpuPaddedShape (C shim) | 0xeabf0e0 | tensorflow::XlaTpuPaddedShapeFn boundary |
XlaShapeToTpuShapeRepresentation (C shim) | 0xeabefa0 | Shape-representation boundary |
3. The Tile and the Padding Rules
Purpose
A device buffer's physical bytes are the logical array reshaped into (sublane, lane) tiles, laid out tile-major. This section gives the tile geometry, the exact 2nd-minor padding arithmetic, and the granule % sublane invariant a reimplementer must enforce. The tile values are read from the chip's tpu::TpuTopology at boot, never hard-coded.
The tile geometry
The default leaf tile is (SublaneCount, LaneCount), the same xla::Tile layout assignment stamps onto every leaf:
LaneCount= 128 across every generation (topology field*((qword*)topology+52), used as the minor-dim bound).SublaneCount= topology field*((qword*)topology+51)— 8 on this v5+/v6e/TPU7x build, 16 on an earlier generation (the 2nd-minor bound).granule_bytes=*(qword*)(topology->[1] + 200)— the hardware DMA granule.GetCompactTilesandPad2ndMinorCompactbothCHECK(granule_bytes % sublane_bytes == 0)at line 77, wheresublane_bytes = 4 * topology->[51](4 bytes × sublane count).
A leaf's physical run is therefore: tiles tiled over the two minor dims, each tile SublaneCount × LaneCount 4-byte slots, padded out. For sub-byte dtypes an extra subtile is appended (§3.3).
The 2nd-minor padding — Pad2ndMinorCompact
Pad2ndMinorCompact (0x1d6af5c0) computes the padded 2nd-minor extent and the per-tile element count. It is the precise rule a reimplementer must copy.
function Pad2ndMinorCompact(topology, extent /*2nd-minor logical*/, element_type): // 0x1d6af5c0
lane = topology[52] // LaneCount, 128
if extent >= lane:
padded = round_up(extent, lane) // ceil(extent/lane) * lane
else:
padded = next_pow2(extent) // 1 << ceil_log2(extent)
granule_bytes = *(topology->[1] + 200)
sublane_bytes = 4 * topology[51] // 4 * SublaneCount
CHECK(granule_bytes % sublane_bytes == 0) // line 77
per_tile_rows = (granule_bytes / sublane_bytes) * ElementPackingFactor(element_type, /*pack=*/0)
return max(padded, per_tile_rows)
QUIRK — the sub-
LaneCountbranch rounds to the next power of two, not to the lane count. A 2nd-minor extent of 5 pads to 8, not to 128; an extent of 100 pads to 128 (the lane multiple). The padded extent is then floored atper_tile_rowsso a tile is never smaller than one granule's worth of sublanes. A reimplementation that always rounds the 2nd-minor toLaneCountwill over-pad every small buffer and disagree with libtpu's HBM footprint.
Element packing — ElementPackingFactor
ElementPackingFactor (0x1d6b03e0) returns how many logical elements share one 4-byte physical slot. It is fully table-driven:
function ElementPackingFactor(topology, element_type, pack_pred_as_bit): // 0x1d6b03e0
tc_max = topology->[432] // tc_max_packing_factor, a power of two
REQUIRE(popcount(tc_max) == 1) // else FATAL "Unsupported tc_max_packing_factor", line 1489
k = tzcnt(tc_max) // 0..5 (1,2,4,8,16,32)
REQUIRE(element_type < 0x23) // 35 primitive types
if k == 0: return 1 // no packing
return kPackingFactors[2^k][pack_pred_as_bit][element_type]
The kPackingFactors[35] tables are per-tc_max (<2>, <4>, <8>, <16>, <32>) and per-pack-mode (the pack_pred_as_bit flag selects the true table for tc_max >= 8, packing PRED as a single bit). The factor for a given dtype is the number of logical elements that fit in 4 bytes at that tc_max. This is the same packing the layout-assignment ShapeSizeCompact folds into its byte count.
NOTE — sub-byte packing adds subtiles to the tile list. In
GetCompactTiles, after the main(sublane, lane)tile is pushed, ifElementPackingFactor(...) >= 2it appends aGetSubtileForPackingtile (and, when the minor dim has extent 1, aGetSubtileForBreakingMinorDimensionForPackingtile). So a packed buffer'stiles()list has length 2–3, and a reimplementer must emit the subtile or the bytes will not match.
Function Map
| Function | Address | Role |
|---|---|---|
TransferSizeUtil::GetCompactTiles | 0x1d6b11c0 | Build the inlined_vector<xla::Tile,3> for a leaf |
TransferSizeUtil::Pad2ndMinorCompact | 0x1d6af5c0 | 2nd-minor pad + per-tile rows |
TransferSizeUtil::ElementPackingFactor | 0x1d6b03e0 | Logical elements per 4-byte slot (table-driven) |
TransferSizeUtil::GetSubtileForPacking | 0x1d6b1f20 | Sub-byte subtile |
TransferSizeUtil::GetSubtileForBreakingMinorDimensionForPacking | 0x1d6b1fe0 (called in 0x1d6b11c0) | Subtile when minor extent == 1 |
TransferSizeUtil::DoesShapeRequireMultiChunkPacking | 0x1d6b0720 | Multi-chunk packing predicate |
Target::LaneCount / SublaneCount | 0x1d60f400 / 0x1d60f300 | Tile dims from chip descriptor |
4. Byte Sizes — ShapeSizeBytesRaw and Friends
Purpose
The allocator must know how many HBM bytes a device buffer occupies. Three size functions exist, all driven off the padded device shape, and a reimplementer must pick the right one: ShapeSizeBytesRaw (the padded data bytes), ShapeSizeCompact (the same with compact tiling), and ShapeWithMetadataSizeBytes (data + per-buffer metadata, used by the transfer manager). They are surfaced to the C ABI as HardwareLayout::ShapeSize (0xeab0ec0), HardwareLayout::ShapeSizeCompact (0xeab0f20), and HardwareLayout::ShapeSizeCompactRaw (0xeab0f80).
Algorithm — ShapeSizeBytesRaw
ShapeSizeBytesRaw (0x1d6add40) is the dispatcher over element-type families. The branch order is the reimplementation contract.
function ShapeSizeBytesRaw(topology, shape): // 0x1d6add40
et = shape.element_type()
if et == TUPLE (13): // tuple = array of pointers
n = ShapeUtil::TupleElementCount(shape)
return round_up(4 * n, ptr_granule) // 4 B / element, topology->[1]+200
if et == TOKEN (17) or IsZeroElementArray(shape) or MaxElementsInPerSplit == 0:
return 0
if ElementHasBitWidth(shape, 64): // s64/u64/f64/c64-half
return 2 * ShapeSizeBytesRaw(topology, ComponentShape<64>(shape)) // hi/lo split
if ElementHasBitWidth(shape, 128): // c128
return 2 * ShapeSizeBytesRaw(topology, ComponentShape<128>(shape))
if ColorToMemorySpace(layout.memory_space) == kSparseCoreSequencerSflag (12):
return ExtentProduct(shape) * ByteSizeOfPrimitiveType(et) // SparseCore: untiled, dense
if shape.layout().tiles().size() >= 2: // already tiled
return ShapeUtil::ArraySize(shape)
// untiled 4-byte / packed leaf: pad it, then measure
tmp = thread_local Shape
CHECK(SetPaddedShape(topology, shape, &tmp) is OK) // line 264
max_elems = LayoutUtil::MaxElementsInPerSplit(tmp, shape)
if layout.element_size_in_bits == 0:
return max_elems * ByteSizeOfPrimitiveType(et)
else:
bits = element_size_in_bits * max_elems // packed: bit-accurate
return bits/8 + (rounding correction) // ceil to byte
Note — the literal
12thatShapeSizeBytesRaw(0x1d6add40) tests on the untiled-dense branch issparse_core_sequencer_sflagin the canonical LLOMemorySpaceenum — notsparse_core_sequencer_smem, which is14(byte-confirmed byMakeSparseCoreSequencerSmemConstant@0x1d60bc60=mov $0xe,%esi).ColorToMemorySpace(0x1d6ffb00) is abyte_B5435CA[color]remap gated oncolor < 0xA, so its output is already a canonicalMemorySpaceenum value, not a raw layout color. See memory-space-table.md for the 17-value owner table.
GOTCHA — the 64- and 128-bit element split is not "8 bytes per element". A 64-bit buffer is decomposed by
ComponentShape<64>into two 32-bit component buffers (high and low words), each tiled independently as a 4-byte buffer, and the sizes summed (2 * ...). So an[N]f64 buffer is physically two[N]f32 tiled buffers, not one[N]8-byte tiled buffer. A reimplementation that treats f64 as an 8-byte element and tiles it directly will mis-size and mis-address every double-precision buffer. The same split governs how DMA reads the buffer.
NOTE — the
element_size_in_bits != 0branch is the packed-buffer size: it multiplies by the bit width and divides by 8, with a ceiling correction. This is how a bit-packed PRED buffer (1 bit/element) reports its byte size — the0value means "use the dtype's natural byte size", the non-zero value means "this many bits per element, round the total up to bytes".
Function Map
| Function | Address | Role |
|---|---|---|
TransferSizeUtil::ShapeSizeBytesRaw | 0x1d6add40 | Padded data bytes (family dispatch) |
TransferSizeUtil::ShapeSizeCompact | 0x1d6ae8a0 | Compact-tiled byte size |
TransferSizeUtil::ShapeSizeCompactRaw | 0x1d6aea60 | Compact bytes w/o metadata |
TransferSizeUtil::ShapeWithMetadataSizeBytes | 0x1d6aea00 (via 0xeab0ec0) | Data + per-buffer metadata |
HardwareLayout::ShapeSize / ShapeSizeCompact | 0xeab0ec0 / 0xeab0f20 | C-ABI byte-size wrappers |
HardwareLayout::ComponentShape<64> / <128> | 0x1d6d9cc0 / 0x1d6d9e40 | 64/128-bit element split |
xla::ShapeUtil::ArraySize | (OSS) | Tiled-array byte size |
5. Linearization and the HBM Repacker
Purpose
Two operations turn the tiled device shape into actual byte movement: linearization (LinearIndex, the tiled multi-index → flat offset) and the repack kernel (RepackToHardwareLayout, the SIMD routine that physically shuffles host bytes into tiled hardware order). A reimplementer needs both: the index math to address a single element, and the repack to stage a whole buffer for DMA.
Algorithm — LinearIndex
function LinearIndex(topology, shape, multi_index): // 0x1d6b0600
CHECK(!ElementHasBitWidth(shape, 64)) // 64-bit handled by component split, line 646
if shape.layout().tiles().size() > 1: // tiled
return LayoutUtil::LinearIndex(shape, multi_index) // OSS tiled index math
// untiled: stamp a default device layout first, then linearize
tmp = shape
ForEachMutableSubshape(tmp, UpdateLayout) // 0x1d6b05a0 - fill linear layout
return LayoutUtil::LinearIndex(tmp, multi_index)
QUIRK —
LinearIndexrefuses 64-bit element widths outright (CHECK(!ElementHasBitWidth(shape, 64)), line 646). Addressing into an f64/s64 buffer is done on its 32-bit component shapes, never on the 64-bit logical shape — consistent with theShapeSizeBytesRawsplit. The branch ontiles().size() > 1is the same tiled/linear test as everywhere else: a populated tile list routes to the OSS tiledLinearIndex; an empty one stamps a default layout first.
The repack kernel
RepackToHardwareLayout<SUBLANE, 128> (instantiated <2,128> @ 0x1d5c2880, <16,128> @ 0x1d5c3b40, <32,128> @ 0x1d5c3f80) is the hand-written AVX routine that shuffles a contiguous host buffer into hardware tile order before DMA. It processes the buffer in 2048-byte (0x800) tiles (size >> 11 whole tiles + size & 0x7FF remainder):
function RepackToHardwareLayout<S,128>(dst, size_bytes, src, pad_flag): // e.g. 0x1d5c3b40
whole_tiles = size_bytes >> 11 // 2048-byte tiles
rem = size_bytes & 0x7FF
for t in [0 .. whole_tiles):
// AVX interleave: vpunpcklbw/vpunpckhbw/vpunpcklwd transpose
// 4-bit/2-bit field packs via vpsllw + vpand with masks
// dword_84A2F00 / dword_84A2580 / byte_84A2D4C / dword_84A2A98
transpose_tile(dst + t*512, src + t*2048)
if rem:
copy rem bytes into a scratch tile
if pad_flag: memset(scratch + rem, 0xFF, 2048 - rem) // FILL TRAILING PAD WITH 0xFF
transpose_tile(dst_tail, scratch)
NOTE — the
pad_flagargument is the on-device tile-padding made physical: when set, the trailing partial tile is filled with0xFF(memset(..., 255, 2048 - rem)) before the SIMD transpose. The0xFFfill is the byte pattern that lands in the padding lanes/sublanes of a buffer whose logical extent did not fill its last tile. A reimplementation that zero-fills instead will produce byte-different (though numerically equivalent for most reductions) HBM contents, and any checksum/SDC path that hashes raw HBM will disagree. The fixed2048tile (=512 Bpacked output × the 4-way unpack) is generation-independent across the<2/16/32,128>instantiations.
Function Map
| Function | Address | Role |
|---|---|---|
TransferSizeUtil::LinearIndex | 0x1d6b0600 | Tiled/linear multi-index → flat offset |
TransferSizeUtil::UpdateLayout | 0x1d6b05a0 | Stamp default device layout (per-subshape) |
TransferSizeUtil::UpdateLeafLayout | 0x1d6b08c0 | Stamp tile on a single leaf |
RepackToHardwareLayout<2,128> | 0x1d5c2880 | SIMD host→tile repack, sublane 2 |
RepackToHardwareLayout<16,128> | 0x1d5c3b40 | SIMD host→tile repack, sublane 16 |
RepackToHardwareLayout<32,128> | 0x1d5c3f80 | SIMD host→tile repack, sublane 32 |
6. The Device-Buffer Residency Record
Purpose
A logical buffer can be a tuple of many leaves, and each leaf lives at its own HBM address. The runtime tracks "which device addresses hold this buffer's leaves, and what is the on-device shape" in an xla::ShapedBuffer, mirrored across the C ABI as XLA_ShapedBuffer. This is the residency record the transfer manager, the executable, and PJRT pass around; the allocator hands out the addresses (hbm-allocator.md), and this record remembers them.
The XLA_ShapedBuffer record
Recovered from ApiConverter::ToC(const xla::ShapedBuffer&, XLA_ShapedBuffer*) (0xfcb8580):
xla::ShapedBuffer (the on-device residency record)
+336 on_device_shape (xla::Shape; the PADDED device shape from §2)
+656 device_ordinal (int)
+664 buffers (ShapeTree<DeviceAddressBase>; one leaf per tuple element)
XLA_ShapedBuffer (C-ABI mirror)
+0.. on_device_shape (XLA_Shape, via ApiConverter::ToC)
+536 device_ordinal (int)
+544 addresses[] (SE_DeviceAddressBase array, 24 B stride) count @ +552
Each leaf is a SE_DeviceAddressBase (the stream_executor::DeviceAddressBase mirror), a 48-byte record in the source vector copied down to a 24-byte C-ABI entry of {opaque_ptr, size, payload}. The (opaque_ptr, size) pair is the HBM (offset, byte-size) the allocator returned; size is exactly the ShapeSizeBytesRaw of that leaf's padded device shape.
NOTE — the record stores the on-device (padded, tiled) shape, not the host shape. Two leaves with the same logical host shape but different
memory_spacecolors are distinct device shapes and occupy different tiers. The singledevice_ordinalpins the whole buffer to one chip; a buffer never straddles devices at this layer (cross-device is sharding, above this). TheMaybeOwningDeviceMemory/ScopedShapedBuffervariants (seen inXlaComputationLaunchContext::PopulateOutputs@0xeadbb80) wrap the same record with an ownership flag deciding whether dropping the record frees the HBM — see buffer-donation-aliasing.md.
The driver-side buffer handle
Below the XLA ShapedBuffer, the driver moves bytes through MaybeOwningDmaBuffer (seen in JfHbmWriteQueue and TpuPxcDriver::WriteToMemoryHelper), a (ptr, size, optional<SyncFlag>) tuple. The optional<SyncFlag> is the completion handshake; the DMA into the buffer signals it on completion. This is the wire end of the residency record: the ShapedBuffer's per-leaf address becomes a MaybeOwningDmaBuffer at DMA-issue time. The SyncFlag protocol is sflag-protocol.md; the DMA alignment floor that constrains the leaf offset is hbm-dma-alignment.md.
Function Map
| Function | Address | Role |
|---|---|---|
ApiConverter::ToC(const xla::ShapedBuffer&, XLA_ShapedBuffer*) | 0xfcb8580 | Residency record → C ABI |
ApiConverter::FromC(XLA_ShapedBuffer*) | 0xfcb7000 | C ABI → xla::ShapedBuffer |
ApiConverter::ToC(const DeviceAddressBase&) | 0xfcb78c0 | Per-leaf (ptr,size) → SE_DeviceAddressBase |
ApiConverter::ToC(const xla::ShapeIndex&) | 0xfcb82e0 | Leaf index within the tuple tree |
XlaComputationLaunchContext::PopulateOutputs | 0xeadbb80 | Build output ScopedShapedBuffers post-execute |
TpuTransferManager::CanShapedBufferBeAccessedNow | 0xeaba6e0 | Residency-readiness check |
TpuTransferManager::ReadDynamicShapes | 0xe9735a0 | Read back dynamic dims into the shape |
Related Components
| Component | Relationship |
|---|---|
xla::jellyfish::TransferSizeUtil | Owns the whole shape→device-shape mapping, tile padding, and byte sizing on this page |
xla::jellyfish::HardwareLayout (PopulateShape / ComponentShape) | Stamps the tile + memory_space and splits 64/128-bit elements |
TpuTransferManager (0xeab0… family) | The C-ABI surface that calls HostShapeToDeviceShape and the size functions |
ApiConverter (0xfcb7… family) | Marshals Shape/Layout/ShapedBuffer across the PJRT C ABI |
RepackToHardwareLayout<S,128> | The SIMD kernel that physically realises the tiling for DMA |
Cross-References
- layout-assignment.md — the compile-time pass that chooses
minor_to_majorand stamps the(SublaneCount, LaneCount)tile this page consumes - overview.md — the six-region memory taxonomy and the
memory_spaceenum that colors each leaf's tier - hbm-allocator.md — the
BestFitAllocatorthat hands out the HBM offset stored in the residency record'sDeviceAddressBase - hbm-dma-alignment.md — the 1024-B DMA floor vs. 16-KiB compile-time alignment that constrains each leaf's offset
- vmem-allocator.md — the on-chip
kAlternatetier whose word size feeds the per-tier tile/granule - buffer-donation-aliasing.md —
MaybeOwningDeviceMemoryownership and input/output aliasing on the residency record - sflag-protocol.md — the
optional<SyncFlag>completion handshake on the driver-sideMaybeOwningDmaBuffer - ../compiler/tpu-program-serialization.md — how the
ComputationLayout(input/output device shapes) is serialized into the executable - ../pjrt/buffer-and-memory.md — the
PJRT_BufferABI and external refcounting above this layer - back to index — Part X — On-Chip Memory & DMA