PJRT RawBuffer Extension (type 8)
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d, 745 MB, ELF x86-64, not stripped)..textis mapped at0xe63c000; for functions in.textthe listed VA equals the file offset. Other wheel versions will differ.
Abstract
The RawBuffer extension (PJRT_Extension_Type id 8, struct_size 0x50 = 80 bytes) is libtpu's untyped byte-level device-memory surface. It exposes seven methods — CreateRawAliasOfBuffer, Destroy, GetOnDeviceSizeInBytes, GetMemorySpace, CopyRawHostToDevice, CopyRawDeviceToHost, GetHostPointer — that move and address raw bytes in HBM (or pinned host memory) by (offset, size) tuples, with no element type, no shape, no tiling, and no de-tiling. It is the deliberate sibling of the typed PJRT_Buffer ABI: where that surface validates element types, marshals dimensions, and de-tiles on readback, the RawBuffer surface is a flat void*-and-length DMA channel into the same underlying device allocation.
The two surfaces wrap different C++ class hierarchies. The typed PJRT_Buffer shim wraps xla::PjRtBuffer → xla::CommonPjRtBufferImpl; the RawBuffer shim wraps xla::PjRtRawBuffer (abstract) → xla::CommonPjRtRawBuffer → xla::CommonPjRtRawBufferImpl (the holder of the shared copy methods) → xla::TpuRawBuffer (concrete TPU, vtable 0x2177cfe0) / xla::CpuRawBuffer (concrete CPU staging, vtable 0x21789af8, identical ordering). The two C wrapper objects also differ in size and ownership: the typed wrapper is 272 bytes (0x110) and exclusively owns its inner buffer, whereas the RawBuffer wrapper is 16 bytes (0x10) holding a tsl::RCReference<xla::PjRtRawBuffer> — a shared, ref-counted co-owner of the device allocation. That single difference is why CreateRawAliasOfBuffer can hand out a zero-copy alias over an already-live typed buffer.
This page owns the extension struct, its seven-method set, the 16-byte wrapper layout, the per-method args offsets and struct_size versioning, the TpuRawBuffer vtable that backs the methods, and the raw host↔device DMA semantics down to the tpu::System::Transfer* entry. The chain node that links this extension into the PJRT_Api is on Extension Chain; the cross-memory-space copy routing and the DMA engine internals are on DMA & Cross-Host Receive; the StreamExecutor allocator bridge that ultimately backs HBM is on Allocator Integration.
For reimplementation, the contract is:
- The extension struct:
struct_size0x50,type8, seven fn-ptr slots at+0x18..+0x48, populated by a single flat table-initializer creator. - The 16-byte RawBuffer C wrapper (
{ RCReference impl@+0x00; PJRT_Client* client@+0x08 }) and its ref-countedDestroy. - Each method's
struct_size(min, cur) literals, its args offsets, and the singlexla::PjRtRawBuffervtable bounce it performs. - The raw copy semantics: an
(offset, size)slice transfer that returns an 80-bytePJRT_Event, with no shape/tile transformation. - The two device-pointer/host-pointer accessors and the rule that gates host-addressability (
pinned_hostonly).
| Extension type id | 8 (PJRT_Extension_Type RawBuffer) |
| Extension struct size | 0x50 (80 bytes); 7 fn-ptr slots at +0x18..+0x48 |
| Creator | pjrt::CreateRawBufferExtension(PJRT_Extension_Base* next) @ 0xe6f52c0 |
.bss storage | 0x224c3990 (raw_buffer_extension); next → profiler 0x22255b98 |
| C wrapper size | 0x10 (16 bytes) — { RCReference<PjRtRawBuffer> impl@+0x00; PJRT_Client* client@+0x08 } |
| Concrete backing (TPU) | xla::TpuRawBuffer (vtable 0x2177cfe0, vptr base 0x2177cff0) |
| Copy-method holder | xla::CommonPjRtRawBufferImpl (vtable +0x28/+0x30 point here) |
| Async gate | every raw copy returns an 80-byte PJRT_Event wrapping a PjRtFuture |
The Extension Struct (type 8, 80 bytes)
Purpose
The extension struct is a flat function-pointer table sharing the common PJRT_Extension_Base header (struct_size, type, _pad, next) with seven raw-buffer method pointers appended. It is .bss-resident at 0x224c3990 and one-shot initialized on the first GetTpuPjrtApi call.
Layout
struct PJRT_RawBuffer_Extension { // struct_size 0x50 (80 bytes)
PJRT_Extension_Base base; // +0x00 struct_size; +0x08 type=8; +0x0c _pad; +0x10 next
/* +0x18 */ PJRT_Error* (*CreateRawAliasOfBuffer)(PJRT_RawBuffer_CreateRawAliasOfBuffer_Args*);
/* +0x20 */ PJRT_Error* (*Destroy) (PJRT_RawBuffer_Destroy_Args*);
/* +0x28 */ PJRT_Error* (*GetOnDeviceSizeInBytes)(PJRT_RawBuffer_GetOnDeviceSizeInBytes_Args*);
/* +0x30 */ PJRT_Error* (*GetMemorySpace) (PJRT_RawBuffer_GetMemorySpace_Args*);
/* +0x38 */ PJRT_Error* (*CopyRawHostToDevice) (PJRT_RawBuffer_CopyRawHostToDevice_Args*);
/* +0x40 */ PJRT_Error* (*CopyRawDeviceToHost) (PJRT_RawBuffer_CopyRawDeviceToHost_Args*);
/* +0x48 */ PJRT_Error* (*GetHostPointer) (PJRT_RawBuffer_GetHostPointer_Args*);
};
Creator
pjrt::CreateRawBufferExtension @ 0xe6f52c0 is a pure table initializer — no allocation, no branching, a single ret. The decompile is literal:
function CreateRawBufferExtension(slot, next): // 0xe6f52c0
*(u64*)(slot + 0x00) = 80 // struct_size
*(u32*)(slot + 0x08) = 8 // type
*(u64*)(slot + 0x10) = next // chain link (arg)
*(u64*)(slot + 0x18) = &PJRT_RawBuffer_CreateRawAliasOfBuffer
*(u64*)(slot + 0x20) = &PJRT_RawBuffer_Destroy
*(u64*)(slot + 0x28) = &PJRT_RawBuffer_GetOnDeviceSizeInBytes
*(u64*)(slot + 0x30) = &PJRT_RawBuffer_GetMemorySpace
*(u64*)(slot + 0x38) = &PJRT_RawBuffer_CopyRawHostToDevice
*(u64*)(slot + 0x40) = &PJRT_RawBuffer_CopyRawDeviceToHost
*(u64*)(slot + 0x48) = &PJRT_RawBuffer_GetHostPointer
return slot
Because the table is fully static, the struct can live in zero-initialized .bss and only needs a one-shot __cxa_guard-protected creator call. The next argument is the previously-built node; RawBuffer is the first .bss node constructed inside GetTpuPjrtApi, so its next is set to the .data-resident Profiler extension at 0x22255b98 — the chain terminator. See Extension Chain for the full 17-node walk and why RawBuffer ends up at walk position 16 despite being built first.
The 16-byte C wrapper
struct PJRT_RawBuffer { // sizeof = 0x10 (16 bytes)
/* +0x00 */ tsl::RCReference<xla::PjRtRawBuffer> impl; // SHARED, ref-counted co-owner
/* +0x08 */ PJRT_Client* client; // borrowed (not owned)
};
Both fields are byte-confirmed: Destroy (0xe6f4e40) reads the inner RCReference at wrapper+0x00 and frees the wrapper after the refcount path; CreateRawAliasOfBuffer and GetMemorySpace read the borrowed client at wrapper+0x08. The impl is a tsl::RCReference — the wrapper participates in shared ownership of the underlying xla::PjRtRawBuffer, in contrast to the typed PJRT_Buffer wrapper which owns its inner PjRtBuffer* outright. This is the structural enabler for raw aliasing: two RawBuffer wrappers can co-own the same device allocation by sharing the RCReference.
NOTE — do not conflate the two buffer wrappers. The typed surface (buffer-and-memory.md) uses a 272-byte exclusively-owned wrapper; this surface uses a 16-byte ref-counted wrapper over a different C++ base (
xla::PjRtRawBuffer, notxla::PjRtBuffer). A reimplementer who reuses one wrapper layout for both will mis-sizefreeand mis-model ownership.
Args convention
Every method's args struct follows the canonical { size_t struct_size; void* priv; <handle>; ... } shape. The RawBuffer wrapper handle is at args+0x10 (priv occupies +0x08); each scalar/pointer output is written at args+0x18 and beyond. The first action of every wrapper is pjrt::ActualStructSizeIsGreaterOrEqual(name, min_fields, cur_bytes, args->struct_size) @ 0xf8a4ec0; on mismatch the wrapper operator new(8)-allocates a PJRT_Error carrying the size status and returns it without touching the buffer.
Method Set
Slot Map
All seven methods are byte-confirmed against the decompile: the args-name string, the (min, cur) struct_size literals, the args-output offset, and the single xla::PjRtRawBuffer vtable offset each bounces through. The "vtable bounce" column is the offset into the inner object's vtable (object vptr base 0x2177cff0 for TpuRawBuffer).
| Off | Method | C symbol | Addr | min/cur | vtable bounce / backing |
|---|---|---|---|---|---|
0x18 | CreateRawAliasOfBuffer | pjrt::PJRT_RawBuffer_CreateRawAliasOfBuffer | 0xe6f4d40 | 42 / 32 | static xla::PjRtRawBuffer::CreateRawAliasOfBuffer @ 0xf93f540 |
0x20 | Destroy | pjrt::PJRT_RawBuffer_Destroy | 0xe6f4e40 | 27 / 24 | RCReference dec-ref + vtable+0x08 (deleting dtor) + free(0x10) |
0x28 | GetOnDeviceSizeInBytes | pjrt::PJRT_RawBuffer_GetOnDeviceSizeInBytes | 0xe6f4f20 | 42 / 32 | vtable+0x20 GetOnDeviceSizeInBytes() |
0x30 | GetMemorySpace | pjrt::PJRT_RawBuffer_GetMemorySpace | 0xe6f4f80 | 34 / 32 | vtable+0x10 memory_space() + PJRT_Client_FindMemoryWrapper @ 0xf8605e0 |
0x38 | CopyRawHostToDevice | pjrt::PJRT_RawBuffer_CopyRawHostToDevice | 0xe6f5040 | 39 / 56 | vtable+0x28 CommonPjRtRawBufferImpl::CopyRawHostToDevice @ 0xf91c640 |
0x40 | CopyRawDeviceToHost | pjrt::PJRT_RawBuffer_CopyRawDeviceToHost | 0xe6f5180 | 39 / 56 | vtable+0x30 CommonPjRtRawBufferImpl::CopyRawDeviceToHost @ 0xf91c780 |
0x48 | GetHostPointer | pjrt::PJRT_RawBuffer_GetHostPointer | 0xe6f4ec0 | 34 / 32 | vtable+0x18 TpuRawBuffer::GetHostPointer (pinned_host only) |
The args-name strings are present verbatim in .rodata ("PJRT_RawBuffer_Destroy_Args", "PJRT_RawBuffer_CopyRawHostToDevice_Args", etc.), confirming the public API names.
QUIRK — the
(min, cur)pair for the two copy methods is(39, 56)—min(the smallest accepted field count) is larger in field-count terms than the other methods because the copy args carry the host pointer, offset, size, and out-event. The other five methods acceptmin27–42 withcur24 or 32 bytes. The validator compares the caller'sstruct_sizeagainstcur(current byte size) and the field-countmin; a caller built against an older, smaller header is accepted as long as it reachesmin.
CreateRawAliasOfBuffer (slot 0x18)
Purpose
Create a raw, untyped alias over an existing typed PJRT_Buffer. The alias is a new 16-byte RawBuffer wrapper that shares — via RCReference — the donor buffer's underlying device allocation. No bytes are copied; the alias is a second co-owner of the same HBM. This is how a caller obtains a byte-level handle to a buffer that was created through the typed BufferFromHostBuffer path.
Algorithm
function PJRT_RawBuffer_CreateRawAliasOfBuffer(args): // 0xe6f4d40
if !ActualStructSizeIsGreaterOrEqual("PJRT_RawBuffer_CreateRawAliasOfBuffer_Args", 42, 32, args.struct_size):
return new PJRT_Error{ size_status } // operator new(8)
// args.buffer @ +0x10 is a typed PJRT_Buffer wrapper; **(a1+0x10) = inner xla::PjRtBuffer*
inner_typed = *(*(args + 0x10)) // typed wrapper -> PjRtBuffer*
sor = xla::PjRtRawBuffer::CreateRawAliasOfBuffer(inner_typed) // 0xf93f540 -> StatusOr<RCReference>
if sor.is_error: // discriminant low bit set
err = new PJRT_Error(8); *err = sor.status // and Unref the StatusRep
return err
// success: build the 16-byte raw wrapper
raw_wrapper = operator new(0x10)
raw_wrapper[+0x00] = sor.value (the RCReference, move-out)
raw_wrapper[+0x08] = *(args[+0x10] + 0x08) // copy client from DONOR wrapper+0x08
args[+0x18] = raw_wrapper // OUT
return NULL
The factory xla::PjRtRawBuffer::CreateRawAliasOfBuffer(PjRtBuffer*) @ 0xf93f540 walks the global per-platform factory registry xla::GetFactoryFuncs()::funcs @ 0x224c70c8 (guard 0x224c70d0) — each entry a bool(*)(StatusOr<RCReference>&, PjRtBuffer*) — and returns the first registered factory that accepts the buffer's platform. If none accepts, it builds a StrCat error. The client pointer for the new alias is copied from the donor's wrapper +0x08, so the alias shares the donor's PJRT_Client.
QUIRK — the alias shares the donor's device allocation through the
RCReference; it is not a copy and does not pin a second HBM region. Destroying the alias decrements the shared refcount; the device memory is freed only when the last co-owner drops. A reimplementer must route the alias'sDestroythrough the same ref-counted decrement, not a flat free of the device buffer.
Function Map
| Function | Addr | Role |
|---|---|---|
pjrt::PJRT_RawBuffer_CreateRawAliasOfBuffer | 0xe6f4d40 | C wrapper; build raw alias wrapper |
xla::PjRtRawBuffer::CreateRawAliasOfBuffer | 0xf93f540 | static factory; walks platform registry |
xla::GetFactoryFuncs()::funcs | 0x224c70c8 | registry of per-platform raw-alias factories |
Destroy (slot 0x20)
Purpose
Release the caller's reference to a RawBuffer. Because the wrapper holds a ref-counted RCReference, Destroy is not a flat free — it decrements the shared refcount and only runs the inner deleting destructor when the count reaches zero. Then it frees the 16-byte wrapper unconditionally.
Algorithm
function PJRT_RawBuffer_Destroy(args): // 0xe6f4e40
if !ActualStructSizeIsGreaterOrEqual("PJRT_RawBuffer_Destroy_Args", 27, 24, args.struct_size):
return new PJRT_Error{ size_status }
wrapper = args[+0x10] // a1[2]
if wrapper != NULL:
inner = wrapper[+0x00] // the RCReference's pointer
if inner != NULL:
// canonical TSL refcount: count at inner+0x08 (4-byte)
if inner.refcount == 1 || atomic_dec(inner+0x08) == 0:
inner.vtable[+0x08](inner) // deleting dtor (~PjRtRawBuffer/D0)
free(wrapper) // 16-byte wrapper, unconditional
return NULL
The fast-path check refcount == 1 skips the atomic when the caller holds the sole reference; otherwise a lock-prefixed decrement determines whether the device buffer is freed. The deleting destructor lives at the inner object's vtable+0x08 (xla::TpuRawBuffer::~TpuRawBuffer [D0] @ 0xf83c5e0).
GOTCHA — the wrapper is freed every time
Destroyis called, even when the shared refcount has not reached zero (another alias still co-owns the device memory). The wrapper free and the device-memory free are decoupled: the 16-byte C handle goes away immediately; the HBM survives until the lastRCReferencedrops. TreatDestroyas "drop my handle," not "free the buffer."
GetOnDeviceSizeInBytes (slot 0x28) and GetHostPointer (slot 0x48)
These two are minimal single-bounce accessors with identical structure: validate struct_size, triple-dereference the wrapper to reach the inner vtable, call one virtual method, write the result to args+0x18. They never allocate on the success path.
function PJRT_RawBuffer_GetOnDeviceSizeInBytes(args): // 0xe6f4f20
// min 42, cur 32
inner = *(args[+0x10]) // **(a1+0x10)
args[+0x18] = inner.vtable[+0x20].GetOnDeviceSizeInBytes() // int64
function PJRT_RawBuffer_GetHostPointer(args): // 0xe6f4ec0
// min 34, cur 32
inner = *(args[+0x10])
args[+0x18] = inner.vtable[+0x18].GetHostPointer() // void* or NULL
GetOnDeviceSizeInBytes returns the concrete HBM byte count (xla::TpuRawBuffer::GetOnDeviceSizeInBytes @ 0xf838880, which reads the resolved TpuBufferBase+0x50). This includes any device-side tile padding — it is the physical allocation size, not product(dims) * elem_size.
GetHostPointer (xla::TpuRawBuffer::GetHostPointer @ 0xf837b60) returns a host-dereferenceable pointer only for pinned_host buffers; for tpu_hbm it returns NULL. The implementation gates on the buffer's memory-space kind string (an 11-byte overlapping compare against the "pinned_host" magic) and, on a pinned buffer, follows the AsyncValue indirect-node chain to return the resolved TpuBufferBase+0x48.
QUIRK —
GetHostPointerreturning NULL is not an error — it is the correct answer for any HBM-resident buffer, because HBM is not host-addressable. A caller must check for NULL and fall back toCopyRawDeviceToHost; treating NULL as a failure code will break on every device buffer. This mirrors the typed surface'sOpaqueDeviceMemoryDataPointer(slot 81), which returns a raw HBM virtual address valid only on the owning device/core — see the QUIRK on buffer-and-memory.md.
GetMemorySpace (slot 0x30)
Purpose
Return the C PJRT_Memory* wrapper for the memory space the raw buffer lives in. The inner buffer's memory_space() yields a C++ xla::PjRtMemorySpace*; the wrapper round-trips it through the client's memory-wrapper cache to obtain the C handle.
Algorithm
function PJRT_RawBuffer_GetMemorySpace(args): // 0xe6f4f80
if !ActualStructSizeIsGreaterOrEqual("PJRT_RawBuffer_GetMemorySpace_Args", 34, 32, args.struct_size):
return new PJRT_Error{ size_status }
inner = *(args[+0x10]) // **(a1+0x10)
mem_cpp = inner.vtable[+0x10].memory_space() // xla::PjRtMemorySpace*
mem_c = PJRT_Client_FindMemoryWrapper(mem_cpp, args[+0x10].client) // 0xf8605e0; client @ wrapper+0x08
args[+0x18] = mem_c
if mem_c == NULL:
return new PJRT_Error{ MakeErrorImpl<12>("Could find memory_space() for RawBuffer") }
return NULL
The error path is byte-confirmed down to the literal: absl::status_internal::MakeErrorImpl<12> (code 12 = Internal) with message "Could find memory_space() for RawBuffer" sourced from pjrt_c_api_raw_buffer_internal.cc. The client used by the finder is the borrowed client at wrapper+0x08 — the same field copied into a raw alias by CreateRawAliasOfBuffer. The five memory-space classes and their kind strings (tpu_hbm, pinned_host, unpinned_host, device, cross-pod megascale) are documented on buffer-and-memory.md; this method just resolves the wrapper, it does not classify.
CopyRawHostToDevice (slot 0x38) and CopyRawDeviceToHost (slot 0x40)
Purpose
The byte-granular transfer surface. Each copies size bytes between a host void* and an (offset, size) slice of the device buffer, returning an 80-byte PJRT_Event that fires when the DMA completes. There is no element type, no shape, and no tiling transformation — the bytes are moved verbatim. This is the defining contrast with the typed ToHostBuffer (slot 75), which de-tiles via ShapeUtil::DeviceShapeToHostShape before transfer.
Algorithm
The two directions are byte-for-byte mirror images; only the vtable offset (+0x28 vs +0x30) and the host-pointer semantics (source vs destination) differ.
function PJRT_RawBuffer_CopyRawHostToDevice(args): // 0xe6f5040 (mirror: 0xe6f5180, vtable+0x30)
if !ActualStructSizeIsGreaterOrEqual("PJRT_RawBuffer_CopyRawHostToDevice_Args", 39, 56, args.struct_size):
return new PJRT_Error{ size_status }
inner = *(args[+0x10]) // **(a1+0x10)
host = args[+0x18] // host_src (or host_dst for D2H)
offset = args[+0x20] // device byte offset
size = args[+0x28] // byte count
// single bounce into the shared copy-method holder:
future = inner.vtable[+0x28].CopyRawHostToDevice(host, offset, size) // CommonPjRtRawBufferImpl @ 0xf91c640
// returns PjRtFuture on stack (-0x58)
// wrap the future as an 80-byte PJRT_Event:
event = operator new(0x50)
event[+0x00] = future.async_value (move) // tsl::AsyncValue*
event[+0x08..0x18] = profiling closure 0 // AnyInvocable (NULL-policy patched if empty)
event[+0x28..0x38] = profiling closure 1 // AnyInvocable
event[+0x48] = 0
args[+0x30] = event // OUT: PJRT_Event*
// tear down the temp future: run any non-empty closure dtors,
// then AsyncValue::Destroy on refcount->0
return NULL
The args layout for both:
| Offset | Field | Direction |
|---|---|---|
+0x00 | struct_size | validated first |
+0x10 | buffer | PJRT_RawBuffer* wrapper |
+0x18 | host | source (H2D) / destination (D2H) host pointer |
+0x20 | offset | device byte offset of the slice |
+0x28 | size | byte count to transfer |
+0x30 | event | OUT: PJRT_Event* (operator new(0x50)) |
The vtable+0x28/+0x30 slots both point into xla::CommonPjRtRawBufferImpl (the shared copy-method holder), not into TpuRawBuffer directly — the concrete TPU/CPU classes inherit these copy methods. The middle layer in turn calls the AndReturnEvent variants (TpuRawBuffer::CopyRawHostToDeviceAndReturnEvent @ 0xf8388c0 at vtable+0x40, CopyRawDeviceToHostAndReturnEvent @ 0xf838ea0 at vtable+0x48), which bounds-check the slice (ValidateSlice @ 0xf837be0), take a sub-view (SliceBuffer @ 0xf837d80), wrap the DMA in an RAII tpu::WithTransferRequirements, and drive the hardware byte-mover through tpu::System::TransferToDevice @ 0x1d0afa20 / TransferFromDevice @ 0x1d0b0160. The full three-layer DMA pipeline and the tpu::TpuPxcDriver byte-mover are documented on DMA & Cross-Host Receive; this page stops at the C-ABI wrapper and the first vtable bounce.
GOTCHA — the
(offset, size)slice is bounds-checked inside theAndReturnEventlayer (ValidateSlice), not in the C wrapper. The C wrapper performs no validation ofoffset/sizeagainst the buffer's on-device size — it forwards them straight to the inner method. A reimplementer must not assume the wrapper guards against an out-of-range slice; the guard is one layer down, and the failure surfaces as an error inside the returned event/future, not as a synchronousPJRT_Errorfrom the wrapper.
Function Map
| Function | Addr | Role |
|---|---|---|
pjrt::PJRT_RawBuffer_CopyRawHostToDevice | 0xe6f5040 | C wrapper H2D; wrap future as event |
pjrt::PJRT_RawBuffer_CopyRawDeviceToHost | 0xe6f5180 | C wrapper D2H (mirror) |
xla::CommonPjRtRawBufferImpl::CopyRawHostToDevice | 0xf91c640 | vtable+0x28; PjRtFuture-returning |
xla::CommonPjRtRawBufferImpl::CopyRawDeviceToHost | 0xf91c780 | vtable+0x30 |
xla::TpuRawBuffer::CopyRawHostToDeviceAndReturnEvent | 0xf8388c0 | vtable+0x40; bounds-check + DMA |
xla::TpuRawBuffer::CopyRawDeviceToHostAndReturnEvent | 0xf838ea0 | vtable+0x48 |
tpu::System::TransferToDevice | 0x1d0afa20 | hardware DMA (host→HBM) |
tpu::System::TransferFromDevice | 0x1d0b0160 | hardware DMA (HBM→host) |
TpuRawBuffer Backing Vtable
The seven C wrappers bounce into the concrete xla::TpuRawBuffer vtable @ 0x2177cfe0 (object vptr base 0x2177cff0), reconstructed from R_X86_64_RELATIVE relocations. xla::CpuRawBuffer (vtable 0x21789af8) has identical ordering — the abstract xla::PjRtRawBuffer base fixes the slot layout. Only the slots the C wrappers actually use are reproduced; the full vtable continues with copy/slice/async helpers used by the cross-memory-space path.
| Vtbl off | Target | Method | Used by |
|---|---|---|---|
+0x00 | 0xf83c5a0 | ~TpuRawBuffer() [complete, D2] | — |
+0x08 | 0xf83c5e0 | ~TpuRawBuffer() [deleting, D0] | Destroy |
+0x10 | 0xf83c640 | memory_space() → this+0x10 | GetMemorySpace |
+0x18 | 0xf837b60 | GetHostPointer() → pinned_host only, else 0 | GetHostPointer |
+0x20 | 0xf838880 | GetOnDeviceSizeInBytes() → TpuBufferBase+0x50 | GetOnDeviceSizeInBytes |
+0x28 | 0xf91c640 | CommonPjRtRawBufferImpl::CopyRawHostToDevice(const void*, l, l) | CopyRawHostToDevice |
+0x30 | 0xf91c780 | CommonPjRtRawBufferImpl::CopyRawDeviceToHost(void*, l, l) | CopyRawDeviceToHost |
+0x38 | 0xf83c660 | OpaqueDeviceMemoryDataPointer() → TpuBufferBase+0x48 | (typed slot 81) |
+0x40 | 0xf8388c0 | CopyRawHostToDeviceAndReturnEvent(const void*, l, l) | copy middle layer |
+0x48 | 0xf838ea0 | CopyRawDeviceToHostAndReturnEvent(void*, l, l) | copy middle layer |
The relevant TpuRawBuffer / TpuBufferBase field offsets (decoded from the accessor bodies):
TpuRawBuffer+0x10 = PjRtMemorySpace* (returned by memory_space())
TpuRawBuffer+0x18 = tsl::AsyncValueRef<tpu::TpuBufferBase> (the device handle; accessors
BlockUntilReady then follow the indirect-node chain
while (state & 3) node = node->[+0x10])
resolved TpuBufferBase+0x48 = raw HBM device pointer
resolved TpuBufferBase+0x50 = on-device size in bytes
NOTE —
memory_space()(vtable+0x10) returnsthis+0x10directly — the memory space pointer is a cached field on theTpuRawBuffer, not a computed value. The device pointer and size, by contrast, require resolving theAsyncValueRef<TpuBufferBase>at+0x18(blocking until ready and walking the indirect chain), which is whyGetHostPointer/OpaqueDeviceMemoryDataPointercan block whileGetMemorySpacecannot.
Ownership and Async Model
Shared ref-counted ownership
The RawBuffer wrapper holds a tsl::RCReference<xla::PjRtRawBuffer> — the wrapper is one co-owner among possibly several. Destroy decrements the shared refcount and only deletes the device buffer on the last drop; CreateRawAliasOfBuffer adds a co-owner that shares the same allocation. This is the opposite of the typed PJRT_Buffer wrapper, which owns its inner PjRtBuffer* exclusively and destroys it outright on Destroy.
The practical consequence: a raw alias and its donor share HBM. Freeing the donor's device memory (via the typed surface's Delete) while a raw alias still reads it is a use-after-free at the device level that the C-ABI layer does not prevent — ownership coordination is the caller's responsibility, mediated only by the shared refcount on the RCReference (which guards the C++ object's lifetime, not the eager device-memory Delete).
PJRT_Event-gated readiness
Both raw copies return an 80-byte PJRT_Event (operator new(0x50)) wrapping a PjRtFuture: an AsyncValue* plus two profiling AnyInvocable closures and a zeroed tail. The exact field split matches the canonical event layout on events-and-async.md. Readiness is therefore polled (PJRT_Event_IsReady, slot 11), awaited (PJRT_Event_Await, slot 13), or callback-registered (PJRT_Event_OnReady, slot 14) — the same machinery the typed buffer's ReadyEvent (slot 77) uses. The device-side completion token underneath is a tpu::TpuEvent (tpu::ReadyTpuEvent @ 0x1d0b62e0); the middle layer converts it to a client-tracked PjRtFuture via CommonPjRtClient::MakeTrackedReadyFuture @ 0xf91c2e0.
GOTCHA — the returned event signals transfer completion, not buffer readiness. After
CopyRawHostToDevicereturns NULL (success), the bytes are not yet in HBM — the caller must await the event before assuming the device buffer reflects the host data, and before reusing/freeing the host source. A reimplementer who treats the synchronous NULL return as "transfer done" will race the DMA.
Considerations
- No type/shape safety. The raw surface trusts the caller's
(offset, size). It is intended for foreign consumers that already know the byte layout — dlpack import/export, NumPy zero-copy, custom device kernels — not for general typed I/O. Use the typedPJRT_Buffersurface when element type and de-tiling matter. CpuRawBufferparity. The CPU staging backing class shares the vtable ordering, so the same seven methods work for CPU-staging raw buffers; only the concrete DMA path differs (CpuRawBuffer::CopyRaw*AndReturnEventvs the TPUtpu::System::Transfer*).- Aliasing vs donation.
CreateRawAliasOfBufferis a zero-copy co-ownership alias (sharedRCReference), distinct from the typed surface'sDonateWithControlDependency(slot 130), which invalidates the donor and produces a new buffer whose HBM aliases the donor's. The raw alias leaves the donor fully usable. - Not byte-traced (LOW). The exact
AsyncValue/closure field split inside the 80-byte event is taken from the sibling event page rather than re-derived here; the C wrapper'soperator new(0x50)and the move-in sequence (async value + twoAnyInvocableclosures + zeroed+0x48) are byte-confirmed, but the precise semantic role of each closure offset is owned by events-and-async.md. TheAndReturnEvent/tpu::System::Transfer*chain addresses are confirmed by symbol but the per-instruction DMA body was not re-traced on this page (HIGH, deferred to dma-and-cross-host-recv.md).
Cross-References
- PJRT Buffer ABI & Memory Layouts — the typed/shaped
PJRT_Buffersibling: 272-byte exclusive wrapper, de-tiling readback, the contrast this page is defined against - Extension Chain — the 17-node
PJRT_Apiextension chain; how the RawBuffer node links to the Profiler terminator - PJRT API Overview — the 140-slot
PJRT_Apiand how extensions hang offextension_start - API Vtable Reconstruction — how
struct_sizeversioning and vtable offsets are recovered from the binary - DMA & Cross-Host Receive — the three-layer raw DMA pipeline beneath
CopyRaw*and thetpu::TpuPxcDriverbyte-mover - Events & Async — the 80-byte
PJRT_Event/PjRtFuturelayout every raw copy returns - Remaining Extensions — the other 12 chain extensions and the construction-order rationale
- Allocator Integration — the StreamExecutor allocator bridge backing the HBM the raw buffer addresses