Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

TPU Buffer Layout

All addresses, struct offsets, and field tags on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build libtpu_lts_20260413_b_RC00, build-id md5 89edbbe81c5b328a958fe628a9f2207d). The image is not stripped; demangled C++ symbol names are quoted verbatim. Other versions will differ.

Abstract

A PJRT device buffer on a TPU is a flat run of HBM bytes whose internal arrangement is not row-major. Between the logical xla::Shape a user hands to the runtime (a (element_type, dimensions[], layout) triple) and the bytes that actually land in HBM sits one translation, xla::jellyfish::TransferSizeUtil::HostShapeToDeviceShape, which (a) chooses a minor-to-major dimension order, (b) pads the two minor-most dimensions up to the chip's (sublane, lane) tile, (c) packs sub-byte and splits >32-bit element types, and (d) stamps an xla::Tile list onto the layout describing the physical tiling. The padded device shape is what the allocator sizes, what a DMA descriptor addresses, and what ShapeSizeBytesRaw measures. This page documents that translation: the shape↔layout data model, the tile-padding arithmetic, the element-packing rules, and the on-device buffer residency record (xla::ShapedBufferXLA_ShapedBuffer) that names which HBM addresses hold a buffer's leaves.

The reader who knows XLA on GPU should hold one analogy and immediately complicate it. On a GPU an xla::Layout is a minor_to_major permutation and little else; the bytes are dense row-major-after-permutation. On TPU the layout additionally carries a tile (xla::Layout::tiles(), a vector of xla::Tile), and the physical buffer is the logical array reshaped into (sublane, lane) tiles laid out tile-major. The lane dimension is 128 across every generation; the sublane dimension is 8 on this v5+/v6e/TPU7x build (an earlier generation used 16) — the same (SublaneCount, LaneCount) tile layout assignment stamps onto every leaf shape. The hardware reads a tile as one contiguous chunk, so a [3, 5] f32 array does not occupy 60 bytes: it pads to [8, 128] (one tile) and occupies one granule, with the trailing lanes/sublanes filled (with 0xFF on the pad path). Wasted padding is real HBM, and minimising it is exactly why the layout-assignment chooser optimises for compact byte size.

This page owns the on-device tiled buffer: the shape→device-shape mapping (HostShapeToDeviceShape / SetPaddedShape), the tile-padding rules (Pad2ndMinorCompact, GetCompactTiles), the element-packing factors, the byte-size functions (ShapeSizeBytesRaw / ShapeSizeCompact / ShapeWithMetadataSizeBytes), and the residency record. It does not reproduce the compile-time choice of minor_to_major (that is layout-assignment.md), the HBM free-list allocator that hands out the offset (that is hbm-allocator.md), the DMA wire alignment (that is hbm-dma-alignment.md), or the PJRT PJRT_Buffer ABI and external refcounting (that is pjrt/buffer-and-memory.md).

For reimplementation, the contract is:

  • The shape↔layout data model — the xla::Shape / xla::Layout fields the device layout reads, and the XLA_Shape / XLA_Layout C-ABI mirror with their exact offsets.
  • The shape→device-shape mappingHostShapeToDeviceShapeSetPaddedShape: unpack sub-byte types, recurse, pad the minor dims, stamp the tile.
  • The tile-padding rules — minor dim → multiple of LaneCount (else next pow2); 2nd-minor via Pad2ndMinorCompact; the granule_bytes % sublane_bytes == 0 invariant.
  • The element packingElementPackingFactor (sub-byte pack via tc_max_packing_factor), and the 64/128-bit split (ComponentShape<64>/<128>).
  • The residency recordxla::ShapedBuffer (on-device shape + per-leaf DeviceAddressBase), mirrored to XLA_ShapedBuffer.
Host→device shapexla::jellyfish::TransferSizeUtil::HostShapeToDeviceShape (via HardwareLayout::HostShapeToDeviceShape @ 0xeab0e20)
Padded-shape coreTransferSizeUtil::SetPaddedShape @ 0x1d6ae0e0
Tile builderTransferSizeUtil::GetCompactTiles @ 0x1d6b11c0 (returns inlined_vector<xla::Tile,3>)
2nd-minor padTransferSizeUtil::Pad2ndMinorCompact @ 0x1d6af5c0
Element packingTransferSizeUtil::ElementPackingFactor @ 0x1d6b03e0 (table-driven, kPackingFactors[35])
Padded byte sizeTransferSizeUtil::ShapeSizeBytesRaw @ 0x1d6add40; with metadata ShapeWithMetadataSizeBytes (via HardwareLayout::ShapeSize @ 0xeab0ec0)
Compact byte sizeTransferSizeUtil::ShapeSizeCompact @ 0x1d6ae8a0 (via HardwareLayout::ShapeSizeCompact @ 0xeab0f20)
Tile stampxla::jellyfish::HardwareLayout::PopulateShape; per-leaf TransferSizeUtil::UpdateLeafLayout @ 0x1d6b08c0
LinearizerTransferSizeUtil::LinearIndex @ 0x1d6b0600xla::LayoutUtil::LinearIndex when tiled
HBM repackerRepackToHardwareLayout<sublane,128> @ 0x1d5c3b40 (<16,128>), 0x1d5c2880 (<2,128>), 0x1d5c3f80 (<32,128>)
Residency recordxla::ShapedBufferXLA_ShapedBuffer (ApiConverter::ToC @ 0xfcb8580)
Default tile(SublaneCount, LaneCount) = (8, 128) this build; (16, 128) on an earlier generation
ConfidenceCONFIRMED (byte-anchored) unless a row or callout says otherwise

1. The Shape / Layout Data Model

Purpose

Everything on this page operates on two C++ objects: xla::Shape and the xla::Layout it contains. A reimplementer must reproduce these fields exactly, because the device-layout code reads them by raw offset and the runtime mirrors them across the C ABI into XLA_Shape / XLA_Layout for the PJRT/stream_executor boundary. The compile-time pass that fills in minor_to_major and the memory_space is layout assignment; this page is the consumer.

The xla::Layout fields the device path reads

xla::Layout is an absl::InlinedVector-rich struct; the device code touches only a handful of members. The C-ABI mirror XLA_Layout (built by ApiConverter::ToC(const xla::Layout&, XLA_Layout*) @ 0xfcb7ca0) pins the offsets, and the on-device functions reach the same data through xla::Shape::layout() (returns a Layout*).

Fieldxla::Layout accessXLA_Layout offsetMeaning
minor_to_major[]layout()[2..3] inlined vector (count at [2]>>1)+0 (ptr), +48 (count)Physical dim order, minor-first
dim_level_types / flagslayout() low bytes+400, +404sparse/dense level kinds
index_primitive_type / pointer_primitive_type*((qword*)layout+1)+408sparse index/pointer types
element_size_in_bitslayout()[1]+416non-zero for packed/odd-width elements; 0 ⇒ use dtype byte size
memory_space*((char*)layout+2)+424tier color (HBM/VMEM/…); see overview.md
tiles[]layout()[9] is tiles().size()+56 (ptr), +392 (count)the physical tile list (each xla::Tile is a dim vector)
tail_padding_alignment_in_elements*((qword*)layout+24)+432trailing-element pad quantum

The single most consulted field is layout()[9]the tile count. LinearIndex (0x1d6b0600), ShapeSizeBytesRaw (0x1d6add40), and HasLinearLayout (0x1d6b0160) all branch on whether tiles().size() >= 2 (or > 1): a populated tile list means the buffer is tiled and addressed via ShapeUtil::ArraySize / LayoutUtil::LinearIndex; an empty tile list means a linear (untiled) buffer.

The xla::Shape fields

xla::Shape (and its XLA_Shape mirror, ApiConverter::ToC @ 0xfcb7940) carries the dims and the layout. XLA_Shape offsets, decoded directly from the converter:

XLA_Shape (536 bytes per element in a tuple array):
  +0    element_type      (PrimitiveType, int32)
  +8    dimensions[]      (int64 array)          count @ +56
  +64   dynamic_dims[]    (bool array)           count @ +72
  +80   tuple_shapes[]    (XLA_Shape array, 536 B stride)  count @ +88
  +92   has_layout        (bool)
  +96   layout            (XLA_Layout, present iff has_layout)

NOTE — the on-device functions special-case three element_type families up front, recovered from the validity bitmask 0x2FFF91FFE tested with _bittest64 (the searchable array types, ≤ 0x21), the mask-0x20-family token-and-opaque types, and the bitmask 0x400048000 for the 64/128-bit families. The same 0x2FFF91FFE mask gates the layout-assignment chooser; it is the canonical "is this a tiled array element type" predicate.

Function Map

FunctionAddressRole
ApiConverter::ToC(const xla::Layout&, XLA_Layout*)0xfcb7ca0xla::Layout → C ABI; pins layout offsets
ApiConverter::ToC(const xla::Shape&, XLA_Shape*)0xfcb7940xla::Shape → C ABI; recurses tuple shapes
ApiConverter::FromC(XLA_Shape*)0xfcb7400C ABI → xla::Shape for runtime calls
xla::jellyfish::TransferSizeUtil::HasLinearLayout(const Layout&)0x1d6b0160tile list empty ⇒ linear

2. Host Shape → Device Shape

Purpose

HostShapeToDeviceShape is the entry point that turns a logical host xla::Shape into the physical device shape the TPU stores. It is what the transfer manager, the allocator-sizing path, and the C shim Tpu_OnDeviceShape call. The work is delegated to TransferSizeUtil::HostShapeToDeviceShape, which fans out to SetPaddedShape per leaf and re-stamps the layout. The padded device shape differs from the host shape in three ways: padded minor dims, packed/split element types, and a populated tiles() list.

Entry Point

HardwareLayout::HostShapeToDeviceShape (0xeab0e20)        ── C-ABI wrapper
  ├─ ApiConverter::FromC(host XLA_Shape -> xla::Shape)
  ├─ GetRegisteredDeepseaPlatform / DeepseaPlatform::GetTopology   ── the tpu::TpuTopology*
  ├─ GlobalTpuCompEnv()                                    ── reads comp-env+4124, +4373 (pack flags)
  ├─ TransferSizeUtil::HostShapeToDeviceShape(topology, shape, ...)
  │     └─ per leaf: SetPaddedShape (0x1d6ae0e0) -> HardwareLayout::PopulateShape
  └─ ApiConverter::ToC(device xla::Shape -> XLA_Shape)

Algorithm — SetPaddedShape

SetPaddedShape (0x1d6ae0e0) is the core of the mapping. It is recursive on packed element types and computes, per minor dimension, the padded extent before handing the dims to HardwareLayout::PopulateShape to stamp the tile.

function SetPaddedShape(topology, shape /*buffer leaf*/, out_shape):   // 0x1d6ae0e0
    REQUIRE(LayoutUtil::HasLayout(shape))            // FATAL "Can't pad a shape without knowing its layout", line 1019

    // ---- Step 1: sub-byte / packed element types -> unpack, recurse, re-stamp ----
    pack = ShouldPackPREDAsSingleBit(topology, shape)         // PRED -> 1-bit packing
    if ElementPackingFactor(shape.element_type(), pack) >= 2:
        unpacked = GetUnpackedShape(topology, shape)          // widen the packed dtype
        SetPaddedShape(topology /*unpacked*/, shape, out_shape)   // recurse on unpacked
        out_shape.set_element_type(shape.element_type())      // restore original packed dtype
        ForEachMutableSubshape(out_shape, UpdateLayout)       // re-stamp tile per leaf
        return OK

    pbytes = ShapeUtil::ByteSizeOfPrimitiveType(shape.element_type())

    // ---- Step 2: rank<=1 scalar/vector fast path ----
    if shape.dimensions().size() <= 1:
        CHECK(scalar_layout.minor_to_major().size() == 0)     // line 1040
        chunk_bytes = 4 * topology.chunk_size_elems            // this[53]
        CHECK(chunk_bytes % pbytes == 0)                       // line 1043
        out_shape = PopulateShape(et, {chunk_bytes/pbytes}, tile=1, layout)
        return OK

    // ---- Step 3: only 4-byte (or PRED) element widths are implemented for tiling ----
    if shape.element_type() != PRED and pbytes != 4:
        return Unimplemented("Attempted to map shape ... to on-device TPU padded shape   // line 1050
                              but this is not implemented")

    // ---- Step 4: pad each minor dim up to its chunk bound ----
    padded_dims = copy(shape.dimensions())
    for d in [0 .. rank):                                       // walk in minor_to_major order
        dim_idx  = minor_to_major[rank-1 - d]
        bound    = ChunkBound(topology, d, rank, dim_idx)        // 0x1d6b22e0
        extent   = shape.dimensions(dim_idx)
        padded_dims[dim_idx] = round_up(extent, bound)           // ceil(extent/bound) * bound

    // ---- Step 5: stamp the tile + memory_space ----
    out_shape = HardwareLayout::PopulateShape(et, padded_dims, ntiles, layout)
    return OK

GOTCHA — Step 3 is a hard wall: SetPaddedShape only implements tiling for 4-byte element widths (and PRED). Anything wider takes the 64/128-bit split path in ShapeSizeBytesRaw (§4) before it ever reaches a 4-byte SetPaddedShape; anything narrower is packed up to a 4-byte-equivalent in Step 1. A reimplementation that tries to tile a raw bf16 or s8 buffer directly will hit the Unimplemented at line 1050 — the design forces every element width into the 4-byte tiling kernel via pack/split first.

QUIRK — the rank≤1 fast path (Step 2) does not pad to the 2D (sublane, lane) tile. A scalar or 1-D buffer is reshaped to a single chunk of chunk_bytes/pbytes elements (chunk_bytes = 4 * topology.chunk_size_elems). Only rank≥2 buffers get the 2-minor tile. This is why a 1-D vector of N f32 elements rounds up to a chunk multiple, not to a [1, 128] lane tile.

Function Map

FunctionAddressRole
HardwareLayout::HostShapeToDeviceShape0xeab0e20C-ABI host→device shape
TransferSizeUtil::SetPaddedShape0x1d6ae0e0Per-leaf pad + tile stamp
TransferSizeUtil::GetUnpackedShape0x1d6b2100Widen a packed dtype for the recursion
TransferSizeUtil::ShouldPackPREDAsSingleBit0x1d6b0080PRED 1-bit packing predicate
TransferSizeUtil::ChunkBound0x1d6b22e0Per-dim chunk bound used in the pad round-up
HardwareLayout::PopulateShape0x1d6da360Stamp dims + tile + memory_space
XlaShapeToTpuPaddedShape (C shim)0xeabf0e0tensorflow::XlaTpuPaddedShapeFn boundary
XlaShapeToTpuShapeRepresentation (C shim)0xeabefa0Shape-representation boundary

3. The Tile and the Padding Rules

Purpose

A device buffer's physical bytes are the logical array reshaped into (sublane, lane) tiles, laid out tile-major. This section gives the tile geometry, the exact 2nd-minor padding arithmetic, and the granule % sublane invariant a reimplementer must enforce. The tile values are read from the chip's tpu::TpuTopology at boot, never hard-coded.

The tile geometry

The default leaf tile is (SublaneCount, LaneCount), the same xla::Tile layout assignment stamps onto every leaf:

  • LaneCount = 128 across every generation (topology field *((qword*)topology+52), used as the minor-dim bound).
  • SublaneCount = topology field *((qword*)topology+51)8 on this v5+/v6e/TPU7x build, 16 on an earlier generation (the 2nd-minor bound).
  • granule_bytes = *(qword*)(topology->[1] + 200) — the hardware DMA granule. GetCompactTiles and Pad2ndMinorCompact both CHECK(granule_bytes % sublane_bytes == 0) at line 77, where sublane_bytes = 4 * topology->[51] (4 bytes × sublane count).

A leaf's physical run is therefore: tiles tiled over the two minor dims, each tile SublaneCount × LaneCount 4-byte slots, padded out. For sub-byte dtypes an extra subtile is appended (§3.3).

The 2nd-minor padding — Pad2ndMinorCompact

Pad2ndMinorCompact (0x1d6af5c0) computes the padded 2nd-minor extent and the per-tile element count. It is the precise rule a reimplementer must copy.

function Pad2ndMinorCompact(topology, extent /*2nd-minor logical*/, element_type):  // 0x1d6af5c0
    lane = topology[52]                                  // LaneCount, 128
    if extent >= lane:
        padded = round_up(extent, lane)                  // ceil(extent/lane) * lane
    else:
        padded = next_pow2(extent)                       // 1 << ceil_log2(extent)

    granule_bytes  = *(topology->[1] + 200)
    sublane_bytes  = 4 * topology[51]                    // 4 * SublaneCount
    CHECK(granule_bytes % sublane_bytes == 0)            // line 77
    per_tile_rows  = (granule_bytes / sublane_bytes) * ElementPackingFactor(element_type, /*pack=*/0)

    return max(padded, per_tile_rows)

QUIRK — the sub-LaneCount branch rounds to the next power of two, not to the lane count. A 2nd-minor extent of 5 pads to 8, not to 128; an extent of 100 pads to 128 (the lane multiple). The padded extent is then floored at per_tile_rows so a tile is never smaller than one granule's worth of sublanes. A reimplementation that always rounds the 2nd-minor to LaneCount will over-pad every small buffer and disagree with libtpu's HBM footprint.

Element packing — ElementPackingFactor

ElementPackingFactor (0x1d6b03e0) returns how many logical elements share one 4-byte physical slot. It is fully table-driven:

function ElementPackingFactor(topology, element_type, pack_pred_as_bit):   // 0x1d6b03e0
    tc_max = topology->[432]                              // tc_max_packing_factor, a power of two
    REQUIRE(popcount(tc_max) == 1)                        // else FATAL "Unsupported tc_max_packing_factor", line 1489
    k = tzcnt(tc_max)                                     // 0..5  (1,2,4,8,16,32)
    REQUIRE(element_type < 0x23)                          // 35 primitive types
    if k == 0: return 1                                   // no packing
    return kPackingFactors[2^k][pack_pred_as_bit][element_type]

The kPackingFactors[35] tables are per-tc_max (<2>, <4>, <8>, <16>, <32>) and per-pack-mode (the pack_pred_as_bit flag selects the true table for tc_max >= 8, packing PRED as a single bit). The factor for a given dtype is the number of logical elements that fit in 4 bytes at that tc_max. This is the same packing the layout-assignment ShapeSizeCompact folds into its byte count.

NOTE — sub-byte packing adds subtiles to the tile list. In GetCompactTiles, after the main (sublane, lane) tile is pushed, if ElementPackingFactor(...) >= 2 it appends a GetSubtileForPacking tile (and, when the minor dim has extent 1, a GetSubtileForBreakingMinorDimensionForPacking tile). So a packed buffer's tiles() list has length 2–3, and a reimplementer must emit the subtile or the bytes will not match.

Function Map

FunctionAddressRole
TransferSizeUtil::GetCompactTiles0x1d6b11c0Build the inlined_vector<xla::Tile,3> for a leaf
TransferSizeUtil::Pad2ndMinorCompact0x1d6af5c02nd-minor pad + per-tile rows
TransferSizeUtil::ElementPackingFactor0x1d6b03e0Logical elements per 4-byte slot (table-driven)
TransferSizeUtil::GetSubtileForPacking0x1d6b1f20Sub-byte subtile
TransferSizeUtil::GetSubtileForBreakingMinorDimensionForPacking0x1d6b1fe0 (called in 0x1d6b11c0)Subtile when minor extent == 1
TransferSizeUtil::DoesShapeRequireMultiChunkPacking0x1d6b0720Multi-chunk packing predicate
Target::LaneCount / SublaneCount0x1d60f400 / 0x1d60f300Tile dims from chip descriptor

4. Byte Sizes — ShapeSizeBytesRaw and Friends

Purpose

The allocator must know how many HBM bytes a device buffer occupies. Three size functions exist, all driven off the padded device shape, and a reimplementer must pick the right one: ShapeSizeBytesRaw (the padded data bytes), ShapeSizeCompact (the same with compact tiling), and ShapeWithMetadataSizeBytes (data + per-buffer metadata, used by the transfer manager). They are surfaced to the C ABI as HardwareLayout::ShapeSize (0xeab0ec0), HardwareLayout::ShapeSizeCompact (0xeab0f20), and HardwareLayout::ShapeSizeCompactRaw (0xeab0f80).

Algorithm — ShapeSizeBytesRaw

ShapeSizeBytesRaw (0x1d6add40) is the dispatcher over element-type families. The branch order is the reimplementation contract.

function ShapeSizeBytesRaw(topology, shape):                      // 0x1d6add40
    et = shape.element_type()

    if et == TUPLE (13):                                          // tuple = array of pointers
        n = ShapeUtil::TupleElementCount(shape)
        return round_up(4 * n, ptr_granule)                       // 4 B / element, topology->[1]+200

    if et == TOKEN (17) or IsZeroElementArray(shape) or MaxElementsInPerSplit == 0:
        return 0

    if ElementHasBitWidth(shape, 64):                             // s64/u64/f64/c64-half
        return 2 * ShapeSizeBytesRaw(topology, ComponentShape<64>(shape))   // hi/lo split
    if ElementHasBitWidth(shape, 128):                            // c128
        return 2 * ShapeSizeBytesRaw(topology, ComponentShape<128>(shape))

    if ColorToMemorySpace(layout.memory_space) == kSparseCoreSequencerSflag (12):
        return ExtentProduct(shape) * ByteSizeOfPrimitiveType(et)  // SparseCore: untiled, dense

    if shape.layout().tiles().size() >= 2:                        // already tiled
        return ShapeUtil::ArraySize(shape)

    // untiled 4-byte / packed leaf: pad it, then measure
    tmp = thread_local Shape
    CHECK(SetPaddedShape(topology, shape, &tmp) is OK)            // line 264
    max_elems = LayoutUtil::MaxElementsInPerSplit(tmp, shape)
    if layout.element_size_in_bits == 0:
        return max_elems * ByteSizeOfPrimitiveType(et)
    else:
        bits = element_size_in_bits * max_elems                   // packed: bit-accurate
        return bits/8 + (rounding correction)                     // ceil to byte

Note — the literal 12 that ShapeSizeBytesRaw (0x1d6add40) tests on the untiled-dense branch is sparse_core_sequencer_sflag in the canonical LLO MemorySpace enum — not sparse_core_sequencer_smem, which is 14 (byte-confirmed by MakeSparseCoreSequencerSmemConstant @ 0x1d60bc60 = mov $0xe,%esi). ColorToMemorySpace (0x1d6ffb00) is a byte_B5435CA[color] remap gated on color < 0xA, so its output is already a canonical MemorySpace enum value, not a raw layout color. See memory-space-table.md for the 17-value owner table.

GOTCHA — the 64- and 128-bit element split is not "8 bytes per element". A 64-bit buffer is decomposed by ComponentShape<64> into two 32-bit component buffers (high and low words), each tiled independently as a 4-byte buffer, and the sizes summed (2 * ...). So an [N] f64 buffer is physically two [N] f32 tiled buffers, not one [N] 8-byte tiled buffer. A reimplementation that treats f64 as an 8-byte element and tiles it directly will mis-size and mis-address every double-precision buffer. The same split governs how DMA reads the buffer.

NOTE — the element_size_in_bits != 0 branch is the packed-buffer size: it multiplies by the bit width and divides by 8, with a ceiling correction. This is how a bit-packed PRED buffer (1 bit/element) reports its byte size — the 0 value means "use the dtype's natural byte size", the non-zero value means "this many bits per element, round the total up to bytes".

Function Map

FunctionAddressRole
TransferSizeUtil::ShapeSizeBytesRaw0x1d6add40Padded data bytes (family dispatch)
TransferSizeUtil::ShapeSizeCompact0x1d6ae8a0Compact-tiled byte size
TransferSizeUtil::ShapeSizeCompactRaw0x1d6aea60Compact bytes w/o metadata
TransferSizeUtil::ShapeWithMetadataSizeBytes0x1d6aea00 (via 0xeab0ec0)Data + per-buffer metadata
HardwareLayout::ShapeSize / ShapeSizeCompact0xeab0ec0 / 0xeab0f20C-ABI byte-size wrappers
HardwareLayout::ComponentShape<64> / <128>0x1d6d9cc0 / 0x1d6d9e4064/128-bit element split
xla::ShapeUtil::ArraySize(OSS)Tiled-array byte size

5. Linearization and the HBM Repacker

Purpose

Two operations turn the tiled device shape into actual byte movement: linearization (LinearIndex, the tiled multi-index → flat offset) and the repack kernel (RepackToHardwareLayout, the SIMD routine that physically shuffles host bytes into tiled hardware order). A reimplementer needs both: the index math to address a single element, and the repack to stage a whole buffer for DMA.

Algorithm — LinearIndex

function LinearIndex(topology, shape, multi_index):              // 0x1d6b0600
    CHECK(!ElementHasBitWidth(shape, 64))                        // 64-bit handled by component split, line 646
    if shape.layout().tiles().size() > 1:                        // tiled
        return LayoutUtil::LinearIndex(shape, multi_index)       // OSS tiled index math
    // untiled: stamp a default device layout first, then linearize
    tmp = shape
    ForEachMutableSubshape(tmp, UpdateLayout)                    // 0x1d6b05a0 - fill linear layout
    return LayoutUtil::LinearIndex(tmp, multi_index)

QUIRK — LinearIndex refuses 64-bit element widths outright (CHECK(!ElementHasBitWidth(shape, 64)), line 646). Addressing into an f64/s64 buffer is done on its 32-bit component shapes, never on the 64-bit logical shape — consistent with the ShapeSizeBytesRaw split. The branch on tiles().size() > 1 is the same tiled/linear test as everywhere else: a populated tile list routes to the OSS tiled LinearIndex; an empty one stamps a default layout first.

The repack kernel

RepackToHardwareLayout<SUBLANE, 128> (instantiated <2,128> @ 0x1d5c2880, <16,128> @ 0x1d5c3b40, <32,128> @ 0x1d5c3f80) is the hand-written AVX routine that shuffles a contiguous host buffer into hardware tile order before DMA. It processes the buffer in 2048-byte (0x800) tiles (size >> 11 whole tiles + size & 0x7FF remainder):

function RepackToHardwareLayout<S,128>(dst, size_bytes, src, pad_flag):  // e.g. 0x1d5c3b40
    whole_tiles = size_bytes >> 11          // 2048-byte tiles
    rem         = size_bytes & 0x7FF
    for t in [0 .. whole_tiles):
        // AVX interleave: vpunpcklbw/vpunpckhbw/vpunpcklwd transpose
        // 4-bit/2-bit field packs via vpsllw + vpand with masks
        //   dword_84A2F00 / dword_84A2580 / byte_84A2D4C / dword_84A2A98
        transpose_tile(dst + t*512, src + t*2048)
    if rem:
        copy rem bytes into a scratch tile
        if pad_flag: memset(scratch + rem, 0xFF, 2048 - rem)   // FILL TRAILING PAD WITH 0xFF
        transpose_tile(dst_tail, scratch)

NOTE — the pad_flag argument is the on-device tile-padding made physical: when set, the trailing partial tile is filled with 0xFF (memset(..., 255, 2048 - rem)) before the SIMD transpose. The 0xFF fill is the byte pattern that lands in the padding lanes/sublanes of a buffer whose logical extent did not fill its last tile. A reimplementation that zero-fills instead will produce byte-different (though numerically equivalent for most reductions) HBM contents, and any checksum/SDC path that hashes raw HBM will disagree. The fixed 2048 tile (= 512 B packed output × the 4-way unpack) is generation-independent across the <2/16/32,128> instantiations.

Function Map

FunctionAddressRole
TransferSizeUtil::LinearIndex0x1d6b0600Tiled/linear multi-index → flat offset
TransferSizeUtil::UpdateLayout0x1d6b05a0Stamp default device layout (per-subshape)
TransferSizeUtil::UpdateLeafLayout0x1d6b08c0Stamp tile on a single leaf
RepackToHardwareLayout<2,128>0x1d5c2880SIMD host→tile repack, sublane 2
RepackToHardwareLayout<16,128>0x1d5c3b40SIMD host→tile repack, sublane 16
RepackToHardwareLayout<32,128>0x1d5c3f80SIMD host→tile repack, sublane 32

6. The Device-Buffer Residency Record

Purpose

A logical buffer can be a tuple of many leaves, and each leaf lives at its own HBM address. The runtime tracks "which device addresses hold this buffer's leaves, and what is the on-device shape" in an xla::ShapedBuffer, mirrored across the C ABI as XLA_ShapedBuffer. This is the residency record the transfer manager, the executable, and PJRT pass around; the allocator hands out the addresses (hbm-allocator.md), and this record remembers them.

The XLA_ShapedBuffer record

Recovered from ApiConverter::ToC(const xla::ShapedBuffer&, XLA_ShapedBuffer*) (0xfcb8580):

xla::ShapedBuffer  (the on-device residency record)
  +336  on_device_shape   (xla::Shape; the PADDED device shape from §2)
  +656  device_ordinal    (int)
  +664  buffers           (ShapeTree<DeviceAddressBase>; one leaf per tuple element)

XLA_ShapedBuffer  (C-ABI mirror)
  +0..   on_device_shape  (XLA_Shape, via ApiConverter::ToC)
  +536   device_ordinal   (int)
  +544   addresses[]      (SE_DeviceAddressBase array, 24 B stride)   count @ +552

Each leaf is a SE_DeviceAddressBase (the stream_executor::DeviceAddressBase mirror), a 48-byte record in the source vector copied down to a 24-byte C-ABI entry of {opaque_ptr, size, payload}. The (opaque_ptr, size) pair is the HBM (offset, byte-size) the allocator returned; size is exactly the ShapeSizeBytesRaw of that leaf's padded device shape.

NOTE — the record stores the on-device (padded, tiled) shape, not the host shape. Two leaves with the same logical host shape but different memory_space colors are distinct device shapes and occupy different tiers. The single device_ordinal pins the whole buffer to one chip; a buffer never straddles devices at this layer (cross-device is sharding, above this). The MaybeOwningDeviceMemory / ScopedShapedBuffer variants (seen in XlaComputationLaunchContext::PopulateOutputs @ 0xeadbb80) wrap the same record with an ownership flag deciding whether dropping the record frees the HBM — see buffer-donation-aliasing.md.

The driver-side buffer handle

Below the XLA ShapedBuffer, the driver moves bytes through MaybeOwningDmaBuffer (seen in JfHbmWriteQueue and TpuPxcDriver::WriteToMemoryHelper), a (ptr, size, optional<SyncFlag>) tuple. The optional<SyncFlag> is the completion handshake; the DMA into the buffer signals it on completion. This is the wire end of the residency record: the ShapedBuffer's per-leaf address becomes a MaybeOwningDmaBuffer at DMA-issue time. The SyncFlag protocol is sflag-protocol.md; the DMA alignment floor that constrains the leaf offset is hbm-dma-alignment.md.

Function Map

FunctionAddressRole
ApiConverter::ToC(const xla::ShapedBuffer&, XLA_ShapedBuffer*)0xfcb8580Residency record → C ABI
ApiConverter::FromC(XLA_ShapedBuffer*)0xfcb7000C ABI → xla::ShapedBuffer
ApiConverter::ToC(const DeviceAddressBase&)0xfcb78c0Per-leaf (ptr,size)SE_DeviceAddressBase
ApiConverter::ToC(const xla::ShapeIndex&)0xfcb82e0Leaf index within the tuple tree
XlaComputationLaunchContext::PopulateOutputs0xeadbb80Build output ScopedShapedBuffers post-execute
TpuTransferManager::CanShapedBufferBeAccessedNow0xeaba6e0Residency-readiness check
TpuTransferManager::ReadDynamicShapes0xe9735a0Read back dynamic dims into the shape

ComponentRelationship
xla::jellyfish::TransferSizeUtilOwns the whole shape→device-shape mapping, tile padding, and byte sizing on this page
xla::jellyfish::HardwareLayout (PopulateShape / ComponentShape)Stamps the tile + memory_space and splits 64/128-bit elements
TpuTransferManager (0xeab0… family)The C-ABI surface that calls HostShapeToDeviceShape and the size functions
ApiConverter (0xfcb7… family)Marshals Shape/Layout/ShapedBuffer across the PJRT C ABI
RepackToHardwareLayout<S,128>The SIMD kernel that physically realises the tiling for DMA

Cross-References

  • layout-assignment.md — the compile-time pass that chooses minor_to_major and stamps the (SublaneCount, LaneCount) tile this page consumes
  • overview.md — the six-region memory taxonomy and the memory_space enum that colors each leaf's tier
  • hbm-allocator.md — the BestFitAllocator that hands out the HBM offset stored in the residency record's DeviceAddressBase
  • hbm-dma-alignment.md — the 1024-B DMA floor vs. 16-KiB compile-time alignment that constrains each leaf's offset
  • vmem-allocator.md — the on-chip kAlternate tier whose word size feeds the per-tier tile/granule
  • buffer-donation-aliasing.mdMaybeOwningDeviceMemory ownership and input/output aliasing on the residency record
  • sflag-protocol.md — the optional<SyncFlag> completion handshake on the driver-side MaybeOwningDmaBuffer
  • ../compiler/tpu-program-serialization.md — how the ComputationLayout (input/output device shapes) is serialized into the executable
  • ../pjrt/buffer-and-memory.md — the PJRT_Buffer ABI and external refcounting above this layer
  • back to index — Part X — On-Chip Memory & DMA