Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

TpuTransferManager Roster

All addresses on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build libtpu_lts_20260413_b_RC00, build-id md5 89edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, ELF x86-64 DYN, not stripped; demangled C++ symbols and IDA-recovered C names quoted verbatim). .text VMA equals file offset. Other versions will differ.

Abstract

TpuTransferManager_* is the C-ABI cluster that backs xla::TransferManager across the TfTpu C-API shim. It is the host↔device data-movement surface: marshal an xla::Literal from the host into a TPU xla::ShapedBuffer (and back), push/pull streaming infeed/outfeed literals, ask the byte size of a shape on-device, linearize a literal into raw device-layout buffers, and answer the two "can I touch this buffer right now" predicates. Nineteen extern "C" free functions, recovered by IDA from .rodata references and call targets, all named TpuTransferManager_<Method> and all living in one tight 0xeaba0a00xeabb827 band of .text (one outlier, GetInfeedLayout @ 0xf6a1a80, sits in a different translation unit). The IDA source-path string learning/45eac/tfrc/executor/stream_executor/tpu_transfer_manager_c_api.cc (GetInfeedLayout line 28) pins the cluster's origin file.

The defining structural fact — and the one that separates this roster from TpuExecutor_* — is how the C function reaches the real implementation. TpuExecutor_* functions receive an opaque SE_StreamExecutor* and dispatch through its vtable into the deepsea driver. TpuTransferManager_* functions receive no executor handle; instead each one resolves the singleton xla::TransferManager for the TPU platform on the fly — GetRegisteredDeepseaPlatform() (cached behind a GetUnderlyingDeepseaPlatform::platform Meyers-static guard) → xla::TransferManager::GetForPlatform(platform) @ 0x1342f180 (a StatusOr<TransferManager*>) — and then bounces the call through that manager's C++ vtable at a fixed byte offset. So the whole cluster is a resolve-then-bounce shim: marshal the C structs into C++ via ApiConverter::FromC, look up the per-platform TransferManager singleton, invoke one vtable slot, marshal results back with ApiConverter::ToC, and destroy the temporaries.

This page owns the function roster + per-function impl-symbol / vtable-slot map. The opaque-handle / ApiConverter::ToC/FromC convention and the three-table *ApiFn accessor model are on the shim overview. The runtime-level infeed/outfeed queue mechanism — the PJRT-native tpu::System path, span chunking, blocking semantics, the on-chip queue driver — is on Infeed / Outfeed Queues; the two TpuTransferManager_TransferLiteral{ToInfeed,FromOutfeed} rows here are the legacy StreamExecutor entry into that subsystem and that page documents them as the legacy half. The PJRT buffer/memory ABI that the device-side ShapedBuffer belongs to is on PJRT Buffer & Memory.

For reimplementation, the contract is:

  • The resolve-then-bounce idiom — no SE_* handle is passed; each function resolves the per-platform xla::TransferManager singleton (GetForPlatform @ 0x1342f180) and calls one vtable slot. The C++ TransferManager vtable offset is the dispatch key.
  • The marshalling discipline — every C argument that is a shape / layout / literal / shaped-buffer is ApiConverter::FromC'd into a stack C++ object before the bounce and ~Dtor'd after; out-params come back through ApiConverter::ToC; absl::Status results are returned as a refcounted StatusRep* written into a caller out-pointer.
  • The vtable-slot map — the table below pins each C function to its xla::TransferManager virtual offset (+16, +24, +32, +40, +48, +56, +64, +72, +80, +88, +104, +112, +120, +136), which is the single thing a reimplementer must keep byte-stable.
  • The two non-vtable membersNew/Free are trivial heap ops, and LinearizeToBuffers / GetInfeedLayout bypass the manager vtable entirely and call xla::jellyfish linearizer / TransferSizeUtil directly off the resolved TpuTopology.
Roster size19 extern "C" TpuTransferManager_* free functions (matches overview count)
Address band0xeaba0a00xeabb827 contiguous (18 fns; ReadDynamicShapes ends at 0xeabb827, then TpuComputationPlacer_New @ 0xeabb840) + GetInfeedLayout @ 0xf6a1a80 (outlier TU)
Backing C++ classxla::TransferManager (TPU subclass; resolved per call, not held)
Singleton resolveGetRegisteredDeepseaPlatformxla::TransferManager::GetForPlatform @ 0x1342f180 (StatusOr)
Platform cacheGetUnderlyingDeepseaPlatform::platform (function-local static, __cxa_guard-protected)
Dispatch keyC++ xla::TransferManager vtable byte offset (*(*manager + off))(manager, …))
MarshallingApiConverter::FromC (in) / ToC (out); xla::Shape / Layout / LiteralSlice / MutableBorrowingLiteral / ShapedBuffer temporaries
Status outabsl::status_internal::StatusRep* written to caller out-ptr, old value Unref'd
Reached viaExecutorApiFn table slots (populated by TfTpu_Initialize Bootstrap)
Origin filelearning/45eac/tfrc/executor/stream_executor/tpu_transfer_manager_c_api.cc
Evidence gradeReimplementation-grade / byte-confirmed against IDA decompile (19/19 bodies inspected)

Scope — the per-function ExecutorApiFn slot that points at each of these (and when it is populated) belongs to TfTpu_Initialize Bootstrap. The streaming-queue mechanism the infeed/outfeed pair bottoms out in is owned by Infeed / Outfeed Queues. This page documents the C-ABI roster and the C++-vtable bounce only.


1. The Resolve-then-Bounce Shape of Every Function

Purpose

Every non-trivial TpuTransferManager_* function has the same skeleton, and a reimplementer who internalises it once can read all fourteen vtable-backed members by inspecting only their slot offset and argument marshalling. There is no per-call executor handle to thread; the platform is a process-global singleton.

Algorithm

// canonical body — sub_EABA*  (e.g. GetByteSizeRequirement @ 0xeaba4c0)
function TpuTransferManager_<Method>(<C args...>):
    // 1. marshal every rich C struct arg into a stack C++ object
    Shape         host_shape;  ApiConverter::FromC(&host_shape, c_shape)    // 320-byte xla::Shape
    // (literals → MutableBorrowingLiteral/LiteralSlice; buffers → 784-byte xla::ShapedBuffer)

    // 2. resolve the per-platform TransferManager singleton (cached after first call)
    if !GetUnderlyingDeepseaPlatform::platform.guard:                       // __cxa_guard_acquire
        GetUnderlyingDeepseaPlatform::platform =
            deepsea::executor::GetRegisteredDeepseaPlatform()               // one-time
    StatusOr<TransferManager*> mgr =
        xla::TransferManager::GetForPlatform(platform)                      // 0x1342f180
    if mgr.is_error():                                                      // payload ptr != &dword_0+1
        absl::internal_statusor::ThrowBadStatusOrAccess(mgr)               // never returns on TPU

    // 3. bounce through ONE C++ vtable slot — the offset is the dispatch key
    TransferManager* m = mgr.value();
    result = (*(*(void**)m + <VTABLE_OFFSET>))(m, <unwrapped args>, host_shape)

    // 4. marshal results back out; destroy temporaries (reverse order)
    ApiConverter::ToC(&out_cpp, c_out)         // for out-param functions
    xla::Shape::~Shape(&host_shape)            // every FromC'd temp gets its ~Dtor
    return result

The mgr.is_error() check is the recovered StatusOr idiom: the success sentinel is a payload pointer equal to &dword_0 + 1 (a tagged "ok" value). Anything else is a status, so the code refs/throws ThrowBadStatusOrAccess. On a correctly-initialised TPU platform the manager always resolves, so this branch is dead in practice but must be reproduced for ABI parity.

Why no executor handle

xla::TransferManager in upstream XLA is a per-platform object retrieved from a static registry (TransferManager::GetForPlatform), not a per-device object. The TPU build keeps that model: there is exactly one TPU TransferManager for the process, so the C shim has nothing device-specific to pass and resolves the singleton itself. Contrast TpuExecutor_*, where the executor is the per-device handle and must be threaded through every call.

QUIRK — the manager pointer is never freed by these functions. New/Free allocate and release a 1-byte placeholder (see §2) that the host treats as the "transfer manager handle," but the real xla::TransferManager is the process singleton resolved on every call and outlives every handle. A reimplementer who tries to store device state in the New'd object will find it is a dummy; all state lives in the singleton.


2. Lifecycle (New / Free)

Function Map

FunctionAddressSizeImpl
TpuTransferManager_New0xeaba0a010return operator new(1u); — a 1-byte opaque placeholder handle
TpuTransferManager_Free0xeaba0c016if (h) free(h); — releases the placeholder
// TpuTransferManager_New   sub_EABA0A0
void* TpuTransferManager_New():  return operator new(1)   // dummy handle, no fields

// TpuTransferManager_Free  sub_EABA0C0
void TpuTransferManager_Free(void* h):  if (h) free(h)

GOTCHA — the handle from New carries no state. The actual transfer machinery is the GetForPlatform singleton, lazily resolved on first method call and cached in GetUnderlyingDeepseaPlatform::platform. New/Free exist only so the host's TpuTransferManager C++ shim has an opaque this to pass; mismatching New/Free calls leak/double-free 1 byte, harmless to device state but a host-side allocator bug.


3. Host → Device

Purpose

Move a host xla::Literal into device memory (ShapedBuffer), and answer the layout/size questions a caller needs before allocating that device buffer. All four bounce through the TransferManager vtable.

Function Map

FunctionAddressSizeVtable slotC++ call (unwrapped)
TpuTransferManager_TransferLiteralToDeviceAsync0xeaba240287+40m->TransferLiteralToDevice(stream, LiteralSlice(lit), shaped_buf, opts=0)
TpuTransferManager_GetByteSizeRequirement0xeaba4c0165+80m->GetByteSizeRequirement(host_shape)int64
TpuTransferManager_HostShapeToDeviceShape0xeaba160207+24m->HostShapeToDeviceShape(host_shape)Shape (out via ToC)
TpuTransferManager_ChooseCompactLayoutForShape0xeaba580339+88m->ChooseCompactLayoutForShape(host_shape)StatusOr<Shape>

Algorithm — TransferLiteralToDeviceAsync

// TpuTransferManager_TransferLiteralToDeviceAsync   sub_EABA240
// args: (a1=handle, a2=&SE_Stream, a3=XLA_Literal*, a4=XLA_ShapedBuffer*, a5=StatusRep** out)
ApiConverter::FromC(&lit,  a3)          // 24-byte MutableBorrowingLiteral
ApiConverter::FromC(&buf,  a4)          // 784-byte xla::ShapedBuffer
m      = GetForPlatform(platform).value()
stream = *a2                            // raw SE_Stream pointer, not converted
LiteralSlice slice(&lit)                // wrap the borrowing literal as a slice
status = (*(*m + 40))(m, stream, &slice, &buf, /*opts*/0)   // vtable +40
write_status_out(a5, status)            // *a5 = status; Unref(old) unless aliased
~LiteralBase(&slice); ~ShapedBuffer(&buf); ~MutableBorrowingLiteral(&lit)

The XLA_Literal arrives as a MutableBorrowingLiteral (it borrows host memory the caller still owns) and is re-wrapped as a LiteralSlice for the transfer call — the transfer reads, never writes, the host literal. The SE_Stream is passed raw (*a2), the only place this cluster touches a stream handle; it is the async-ordering token, the transfer enqueues against it.

NOTE — GetByteSizeRequirement (+80) and HostShapeToDeviceShape (+24) are pure shape→scalar / shape→shape queries: they FromC only the host Shape (320-byte stack object), bounce, and (for HostShapeToDeviceShape) ToC the resulting device Shape into the out-param. They allocate no device memory and touch no stream. ChooseCompactLayoutForShape is the third shape-only query; its +88 slot was read directly from the (*(*m + 88))(…) bounce (out StatusRep** first arg, StatusOr<Shape> shape).


4. Device → Host

Purpose

Pull a device ShapedBuffer back into a host literal, and read the dynamic dimensions a device buffer carries (for dynamic-shape programs). Both are completion-callback based.

Function Map

FunctionAddressSizeVtable slotC++ call (unwrapped)
TpuTransferManager_TransferLiteralFromDevice0xeaba360352+32m->TransferLiteralFromDevice(stream, shaped_buf, MutableBorrowingLiteral, done_cb, opts=0)
TpuTransferManager_ReadDynamicShapes0xeabb660455+48m->ReadDynamicShapes(stream, shaped_buf, &out_shape)

Algorithm — TransferLiteralFromDevice

// TpuTransferManager_TransferLiteralFromDevice   sub_EABA360
// args: (a1, a2=&SE_Stream, a3=XLA_ShapedBuffer*, a4=XLA_Literal*, a5=StatusRep*, a6=callback ctx)
ApiConverter::FromC(&buf, a3)           // 784-byte ShapedBuffer
ApiConverter::FromC(&lit, a4)           // MutableBorrowingLiteral (host dest)
m      = GetForPlatform(platform).value()
stream = *a2
copy_construct(&lit2, &lit); copy_construct(&lit3, &lit2)   // 2 MBL copies for the closure
// build std::function<void(absl::Status)> closure capturing (a5=StatusRep out, a6=ctx)
done_cb = { __call_func = TpuTransferManager_TransferLiteralFromDevice::$_0,
            policy      = …::__create<$_0>() }
(*(*m + 32))(m, stream, &buf, &lit3, &done_cb, /*opts*/0)   // vtable +32, async
if done_cb.policy.dtor: done_cb.policy.dtor(captured)        // tear down closure
~MutableBorrowingLiteral(×3); ~ShapedBuffer(&buf)

Unlike the host→device path, this one builds a real std::function<void(absl::Status)> completion callback (the $_0 lambda + a __policy_func thunk; both lambda thunks survive in the symbol table as _ZNSt3__u…TransferLiteralFromDeviceE3$_0E…). The callback writes the final status into the caller's StatusRep* slot when the async device read completes. The double MutableBorrowingLiteral copy is the closure capturing the destination literal by value-of-borrow so it outlives the synchronous return.

NOTE — ReadDynamicShapes (+48) reads the runtime-resolved dimensions of a dynamic-shape device buffer; it FromCs the ShapedBuffer, bounces, and ToCs an out Shape. Slot +48 was read directly from the call *0x30(%rax) (*(*m + 48)) bounce — it sits just above TransferLiteralToDevice (+40), not between the predicates as a naïve roster ordering would suggest.


5. Infeed / Outfeed (legacy StreamExecutor path)

Purpose

The streaming host↔device channels: enqueue a host literal into the on-chip infeed FIFO, dequeue a result literal from the outfeed FIFO. These are the legacy entry into the queue subsystem — they marshal through the C-shim and the TransferManager vtable into the deepsea driver. The modern PJRT-native path (tpu::System::EnqueueInfeed) bypasses this cluster entirely. See Infeed / Outfeed Queues for the queue mechanism, span chunking, and blocking semantics; this section documents only the two C-ABI rows and their vtable bounce.

Function Map

FunctionAddressSizeVtable slotC++ call (unwrapped)
TpuTransferManager_TransferLiteralToInfeed0xeabafa0241+56m->TransferLiteralToInfeed(executor, LiteralSlice(lit))
TpuTransferManager_TransferBuffersToInfeed0xeabb0a0638+136infeed of pre-linearized device buffers
TpuTransferManager_TransferLiteralFromOutfeed0xeabb320260+64m->TransferLiteralFromOutfeed(executor, MutableBorrowingLiteral)
TpuTransferManager_GetInfeedLayout0xf6a1a80163(no vtable — see §7)TransferSizeUtil::ChooseGoodInfeedLayout(topology, shape)

Algorithm — TransferLiteralToInfeed / FromOutfeed

// TpuTransferManager_TransferLiteralToInfeed   sub_EABAFA0
// args: (a1, a2=&SE_StreamExecutor, a3=XLA_Literal*, a4=StatusRep** out)
ApiConverter::FromC(&lit, a3)                  // MutableBorrowingLiteral
m  = GetForPlatform(platform).value()
ex = *a2                                       // the executor IS the queue selector here
LiteralSlice slice(&lit)
status = (*(*m + 56))(m, ex, &slice)           // vtable +56  → driver EnqueueInfeed
write_status_out(a4, status); ~LiteralBase(&slice); ~MutableBorrowingLiteral(&lit)

// TpuTransferManager_TransferLiteralFromOutfeed   sub_EABB320
// args: (a1, a2=&SE_StreamExecutor, a3=XLA_Shape*, a4=XLA_Literal*, a5=StatusRep** out)
ApiConverter::FromC(&shape, a3)                // 320-byte Shape (the expected outfeed shape)
m  = GetForPlatform(platform).value()
ex = *a2
ApiConverter::FromC(&lit, a4)                  // MutableBorrowingLiteral (host dest)
status = (*(*m + 64))(m, ex, &lit)             // vtable +64  → driver DequeueOutfeed
write_status_out(a5, status); ~MutableBorrowingLiteral(&lit); ~Shape(&shape)

Both pass the SE_StreamExecutor* (*a2) into the vtable call — here the executor names which device's infeed/outfeed queue, the only role the executor plays in this cluster besides the device-transfer pair. The infeed call wraps the literal as a read-only LiteralSlice; the outfeed call passes a writable MutableBorrowingLiteral for the result. Slot +56 lands in DeepseaExecutor::EnqueueInfeed, +64 in DequeueOutfeed (the driver leaves are mapped on the Infeed / Outfeed page, §legacy path).

Note — two distinct addresses back the same infeed/outfeed call, and a reimplementer must not conflate them. The Infeed / Outfeed page anchors the host-side TpuTransferManager::TransferLiteralToInfeed C++ shim at 0xe9721c0 and FromOutfeed at 0xe972660, reached through ExecutorApiFn()+560/+576 — that is the caller half (the SE shim that forwards into a *ApiFn slot). The 0xeabafa0/0xeabb320 functions documented here are the callee half — the C-ABI implementations the slot points at. Both halves live in this binary because XLA is statically linked.

NOTE — TransferBuffersToInfeed (0xeabb0a0, 638 bytes) is the pre-linearized variant: instead of a literal it takes an array of already-device-layout buffers and enqueues them, skipping the linearizer. It bounces through vtable slot +136 (call *0x88(%rax)) — a separate, higher slot than TransferLiteralToInfeed's +56, not a shared infeed arm. The buffer pointer/length array argument (like LinearizeToBuffers' output) is FromC'd in a loop before the bounce.


6. Shape, Layout & Buffer-Access Predicates

Purpose

The synchronous metadata side: write a tuple index table into a device buffer (so the device can find each tuple element), and the two predicates that ask whether a buffer can be read/written on the host right now without a device sync.

Function Map

FunctionAddressSizeVtable slotC++ call (unwrapped)
TpuTransferManager_WriteSingleTupleIndexTable0xeaba840689+120m->WriteSingleTupleIndexTable(stream, device_addrs[], shape, &out_addr)
TpuTransferManager_CanShapedBufferBeAccessedNow0xeaba6e0174+104m->CanShapedBufferBeAccessedNow(executor, shaped_buf)bool
TpuTransferManager_CanBufferBeAccessedNow0xeaba7a0141+112m->CanBufferBeAccessedNow(executor, device_addr)bool
TpuTransferManager_PlatformId0xeaba0e0117+16m->PlatformId()se::Platform::Id
TpuTransferManager_ResetDevices0xeabb440525+72m->ResetDevices(executors[])

Algorithm — the predicates and WriteSingleTupleIndexTable

// TpuTransferManager_CanBufferBeAccessedNow   sub_EABA7A0
// args: (a1, a2=&SE_StreamExecutor, a3=SE_DeviceAddressBase*)
ApiConverter::FromC(&addr, a3)                 // 24-byte DeviceAddressBase
m  = GetForPlatform(platform).value()
return (*(*m + 112))(m, *a2, &addr)            // vtable +112 → bool

// TpuTransferManager_CanShapedBufferBeAccessedNow   sub_EABA6E0
ApiConverter::FromC(&buf, a3)                   // 784-byte ShapedBuffer
m  = GetForPlatform(platform).value()
r  = (*(*m + 104))(m, *a2, &buf)                // vtable +104 → bool
~ShapedBuffer(&buf); return r

// TpuTransferManager_WriteSingleTupleIndexTable   sub_EABA840
// builds a heap vector<DeviceAddressBase> from the C address array (24 bytes each),
ApiConverter::FromC per element into operator new(24 * count)   // FromC each region
ApiConverter::FromC(&shape, a_shape)
m  = GetForPlatform(platform).value()
(*(*m + 120))(m, stream, addrs_vec, count, &shape, &out_c_addr, opts)   // vtable +120

PlatformId (+16) is the simplest vtable member — no marshalling, just resolve and return the platform id integer. The two CanBe...AccessedNow predicates are the cheap host-side checks XLA uses to decide whether a host pointer into device-visible memory is coherent without forcing a stream sync; both take the SE_StreamExecutor* plus the buffer/address and return a bool. WriteSingleTupleIndexTable is the heaviest vtable member (689 bytes) because it builds a heap std::vector<DeviceAddressBase> from the C array (each element FromC'd into a 24-byte slot) before the +120 bounce.

GOTCHA — CanBufferBeAccessedNow (+112) takes a single SE_DeviceAddressBase (24 bytes); CanShapedBufferBeAccessedNow (+104) takes a whole XLA_ShapedBuffer (784 bytes) and must ~ShapedBuffer it after the call. They are adjacent vtable slots with swapped numeric order (the shaped-buffer variant is the lower offset +104); a reimplementer who assumes monotonic naming↔offset will mis-wire the table. ResetDevices (+72) takes an array of executors; its call *0x48(%rax) (*(*m + 72)) bounce was read directly from the disassembly.


7. The Two Non-Vtable Members — Direct Linearizer Calls

Purpose

LinearizeToBuffers and GetInfeedLayout are the cluster's odd pair: they do not dispatch through the xla::TransferManager vtable. Instead they resolve the tpu::TpuTopology directly and call xla::jellyfish machinery — the same linearizer / size-util the PJRT-native infeed path uses (see Infeed / Outfeed Queues, §layout contract). This is where the C-shim and the modern runtime share code.

Function Map

FunctionAddressSizeDirect callee
TpuTransferManager_LinearizeToBuffers0xeabab00594xla::jellyfish::LiteralLinearizer::LinearizeToBuffers(topology, …)
TpuTransferManager_GetInfeedLayout0xf6a1a80163xla::jellyfish::TransferSizeUtil::ChooseGoodInfeedLayout(topology, shape)

Algorithm

// TpuTransferManager_GetInfeedLayout   sub_F6A1A80   (tpu_transfer_manager_c_api.cc:28)
// args: (a1=XLA_Shape* in, a2=XLA_Shape* out)
ApiConverter::FromC(&shape, a1)
topology = GetTopology(GetUnderlyingDeepseaPlatform::platform)
CHECK(topology != nullptr)        // FATAL "topology != nullptr" if null
TransferSizeUtil::ChooseGoodInfeedLayout(&out_shape, topology, &shape)
ApiConverter::ToC(&out_shape, a2)
~Shape(&out_shape); ~Shape(&shape)

// TpuTransferManager_LinearizeToBuffers   sub_EABAB00
// args: (…, a3=XLA_Literal*, a4=XLA_Shape*, a5/a6 = out buffer ptr/size arrays, a8=StatusRep**)
ApiConverter::FromC(&lit,   a3)
ApiConverter::FromC(&shape, a4)
topology = platform->topology   // *(*(platform+8)+184)
status = LiteralLinearizer::LinearizeToBuffers(topology, &lit, &shape, &out_vec,
                                               InvokeObject<$_0…unique_ptr<uchar[]>>)
// hand the device-layout buffer vector back to C as two parallel new[] arrays:
*a6 = operator new(8 * count)   // sizes
*a5 = operator new(8 * count)   // pointers; each buffer memcpy'd into a fresh operator new
// (each entry copied from the linearizer's internal 56-byte-stride buffer descriptors)

GetInfeedLayout is a pure shape→layout helper (its FATAL-check source string tpu_transfer_manager_c_api.cc:28 pins the cluster's TU). LinearizeToBuffers is the heavyweight: it linearizes a host literal into device-tiled byte buffers and copies each into a freshly operator new'd block, returning two parallel new[] arrays (pointers + sizes) the host must later release with FreeBuffers.

FreeBuffers — the matching deallocator

// TpuTransferManager_FreeBuffers   sub_EABAF20
// args: (ptr_array, size_array, count)
void TpuTransferManager_FreeBuffers(void** ptrs, void* sizes, int64 count):
    for i in 0..count:  if ptrs[i]: free(ptrs[i])   // each buffer
    free(ptrs)                                       // the pointer array
    if sizes: free(sizes)                            // the size array
FunctionAddressSizeRole
TpuTransferManager_FreeBuffers0xeabaf20114frees the LinearizeToBuffers pointer array + each buffer + size array

GOTCHA — LinearizeToBuffers allocates with operator new (per-buffer + two new[] arrays) but FreeBuffers releases with free(). On this glibc build operator new forwards to malloc, so the pair is balanced — but a reimplementer who wires operator new/operator delete[] to a different allocator than malloc/free will corrupt the heap. The ABI contract is "allocate so that free() releases it." Pair every LinearizeToBuffers with exactly one FreeBuffers(ptrs, sizes, count).


8. Complete Vtable-Slot Map

The single table a reimplementer needs: each C function, its address, and the xla::TransferManager vtable byte offset it bounces through. Every offset was read directly from the decompiled (*(*m + N))(…) expression and cross-checked against the call *0xNN(%rax) disassembly, so all rows are CERTAIN.

C functionAddressVtable offC++ method (inferred)
New0xeaba0a0operator new(1)
Free0xeaba0c0free
PlatformId0xeaba0e0+16PlatformId()
HostShapeToDeviceShape0xeaba160+24HostShapeToDeviceShape(Shape)
TransferLiteralFromDevice0xeaba360+32TransferLiteralFromDevice(…, cb)
TransferLiteralToDeviceAsync0xeaba240+40TransferLiteralToDevice(…)
ReadDynamicShapes0xeabb660+48ReadDynamicShapes(…)
TransferLiteralToInfeed0xeabafa0+56TransferLiteralToInfeed(exec, LiteralSlice)
TransferLiteralFromOutfeed0xeabb320+64TransferLiteralFromOutfeed(exec, MBL)
ResetDevices0xeabb440+72ResetDevices(executors)
GetByteSizeRequirement0xeaba4c0+80GetByteSizeRequirement(Shape)
ChooseCompactLayoutForShape0xeaba580+88ChooseCompactLayoutForShape(Shape)
CanShapedBufferBeAccessedNow0xeaba6e0+104CanShapedBufferBeAccessedNow(exec, buf)
CanBufferBeAccessedNow0xeaba7a0+112CanBufferBeAccessedNow(exec, addr)
WriteSingleTupleIndexTable0xeaba840+120WriteSingleTupleIndexTable(…)
TransferBuffersToInfeed0xeabb0a0+136infeed of device buffers
LinearizeToBuffers0xeabab00— (direct)jellyfish::LiteralLinearizer::LinearizeToBuffers
FreeBuffers0xeabaf20— (free)n/a
GetInfeedLayout0xf6a1a80— (direct)jellyfish::TransferSizeUtil::ChooseGoodInfeedLayout

QUIRK — the vtable offsets are not contiguous-by-roster-order. The C functions are emitted in source order (New, Free, PlatformId, …) but their slots track the xla::TransferManager base-class vtable layout (+16, +24, +32, …), which interleaves base-class virtuals the C shim does not expose. A reimplementer building the C++ TransferManager subclass must reproduce the base-class vtable order, not the C-roster order, or every slot offset above is wrong by a frame.


NameRelationship
xla::TransferManager (TPU subclass)the C++ class every vtable-backed row dispatches into
xla::TransferManager::GetForPlatform @ 0x1342f180the per-platform singleton resolver every function calls
deepsea::executor::GetRegisteredDeepseaPlatformresolves the TPU Platform, cached in GetUnderlyingDeepseaPlatform::platform
ApiConverter::ToC / FromCmarshals XLA_Shape / XLA_Literal / XLA_ShapedBuffer / SE_DeviceAddressBase across the seam
xla::jellyfish::LiteralLinearizer / TransferSizeUtilthe direct callees of LinearizeToBuffers / GetInfeedLayout (no vtable)
ExecutorApiFn tablethe function-pointer struct whose slots point at these C functions
TpuExecutor_* rosterthe contrasting shim: passes an SE_StreamExecutor* handle instead of resolving a singleton

Cross-References

  • The TfTpu C-API Shim — the *ApiFn accessor model, opaque-handle convention, and ApiConverter marshalling this roster relies on
  • TpuExecutor Roster — the contrasting per-device cluster that does thread an SE_StreamExecutor* handle through every call
  • TpuProgram Roster — the sibling serialized-program C-ABI cluster reached through the same ExecutorApiFn table
  • Infeed / Outfeed Queues — the runtime queue mechanism the TransferLiteral{ToInfeed,FromOutfeed} rows are the legacy StreamExecutor entry into
  • PJRT Buffer & Memory — the PJRT buffer/memory ABI the device-side ShapedBuffer belongs to
  • TfTpu_Initialize Bootstrap — the one-time population of the ExecutorApiFn slots that point at these functions