TpuExecutable Roster
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (buildlibtpu_lts_20260413_b_RC00, build-id md589edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, ELF x86-64 DYN, not stripped)..textVMA equals file offset. The C-ABI names (TpuExecutable_*) are IDA-recovered from call targets and thetpu_executor_c_api.cc/tpu_execute_c_api.ccsource-path strings baked into theCHECK/MakeErrorImplcall sites; demangled C++ targets are quoted verbatim. Other versions will differ.
Abstract
TpuExecutable_* is the C-ABI roster for the running form of a compiled TPU program — the handle a StreamExecutor host holds after a compilation has been deserialized onto a device, and through which it enqueues launches, queries the program's HLO, fingerprints it for caching, and re-serializes it. Where TpuProgram_* owns the serializable, queryable compiled artifact and TpuCompiler_* produces it, TpuExecutable_* is the thin extern "C" skin over the live xla::Executable subtree: the host never sees an xla::TpuExecutable by value, it sees an opaque handle and a fixed set of free functions that un-marshal C structs and bounce into the C++ executable.
The roster is exactly 9 free functions, the smallest of the named device-runtime clusters. They split cleanly into four areas: execution (LoadProgramAndEnqueueToStream, ExecuteAsyncOnStream — the two large marshalling functions that drive the device), serialization (Serialize, Deserialize — proto round-trip through xla::DeepseaExecutableProto), metadata (Fingerprint, HloModule — read-only accessors), and lifecycle / array frees (Free, FreeXlaShapeIndexArray, FreeMaybeOwningDeviceAddressArray). Every entry is a forwarder: the C-ABI function holds an opaque XLA_TpuExecutable*-style handle whose first slot points at a C++ object, and dispatches into a method of xla::TpuExecutable / xla::legacy::TpuExecutableInterface, whose concrete leaf is xla::jellyfish::DeepseaExecutable.
The C++ side is a three-layer class tower. xla::Executable is the upstream base; xla::legacy::TpuExecutableInterface (ExecuteAsyncOnStream @ 0x1342cd20) overrides the argument/output marshaling once; the concrete xla::jellyfish::DeepseaExecutable implements the device enqueue (LoadProgramAndEnqueueToStream @ 0x13426260, ToProto @ 0x134282e0, FromProto @ 0x134283e0, fingerprint @ 0x13428a80). The wrapper class xla::TpuExecutable (methods clustered at 0xf8a7560–0xf8adce0) is the legacy StreamExecutor face that the C-ABI roster mirrors slot-for-slot. This page owns the per-function roster and the impl-symbol map; it does not re-derive the runtime execution algorithm.
For reimplementation, the contract is:
- The 9-function roster — name, impl symbol + address, the C++ method it bounces to, and which of the four areas it belongs to.
- The handle convention — every entry takes an opaque handle whose
*handle(or**handle) is aC++ xla::Executable-subtree pointer; the C-ABI function only un-marshals C structs and dispatches. - The serialization wire form —
xla::DeepseaExecutableProtois the on-the-wire shape;Serialize=ToProto+ proto copy,Deserialize=ParseFromArray+FromProto, both returning aStatusOr-style C-ABI status object. - The ownership rules —
Freeruns the C++ virtual destructor through slot+8thenfree()s the handle box; the twoFreeArrayhelpers are barefree();HloModuleis the only metadata accessor that allocates (anXLA_HloModuleviaApiConverter::ToC).
| Roster size | 9 extern "C" TpuExecutable_* functions (verified against the function table) |
| Source files | tpu_executor_c_api.cc (most), tpu_execute_c_api.cc (LoadProgramAndEnqueueToStream) |
| Wrapper class | xla::TpuExecutable (methods 0xf8a7560–0xf8adce0) — legacy SE face |
| Marshaling base | xla::legacy::TpuExecutableInterface::ExecuteAsyncOnStream @ 0x1342cd20 |
| Concrete leaf | xla::jellyfish::DeepseaExecutable (LoadProgram @ 0x13426260, ToProto @ 0x134282e0, FromProto @ 0x134283e0, fingerprint @ 0x13428a80) |
| Serialization proto | xla::DeepseaExecutableProto (operator new(0x28) = 40 bytes) |
| Handle marshalling | ApiConverter::ToC(xla::HloModule const&) for HloModule |
| Evidence grade | Reimplementation-grade / byte-confirmed against IDA decompile |
Scope — the opaque-handle +
*ApiFn()accessor model that frames this roster (how the host reachesTpuExecutable_Fingerprintthrough a function-pointer slot rather than by symbol) is on The TfTpu C-API Shim. The compiled-program handle this executable wraps is on TpuProgram Roster. The runtime execution algorithm behind the two big entries is owned by Execute Async on Stream and Load Program & Enqueue — this page is the C-ABI roster over those entries, not a second copy of them. The PJRT executable lifecycle that coexists with (and does not route through) this legacy path is on PJRT Executable Execution.
1. The Roster at a Glance
The nine functions, grouped by area, with the impl symbol (always the C-ABI free function itself), its address and size, and the C++ method it dispatches into. Confidence reflects how directly the bounce target was observed in the decompiled body.
TpuExecutable_* | Address | Size | Bounces to (C++) | Area |
|---|---|---|---|---|
LoadProgramAndEnqueueToStream | 0xeaafba0 | 4496 | xla::jellyfish::DeepseaExecutable::LoadProgramAndEnqueueToStream @ 0x13426260 (via vtable+96) | Execution |
ExecuteAsyncOnStream | 0xeabd500 | 4708 | xla::legacy::TpuExecutableInterface::ExecuteAsyncOnStream @ 0x1342cd20 (via vtable+24) | Execution |
Serialize | 0xeabea80 | 178 | xla::jellyfish::DeepseaExecutable::ToProto @ 0x134282e0 | Serialization |
Deserialize | 0xeabede0 | 288 | xla::jellyfish::DeepseaExecutable::FromProto @ 0x134283e0 | Serialization |
Fingerprint | 0xeabea40 | 54 | DeepseaExecutable::fingerprint @ 0x13428a80 (cached field at obj+96 → +648/+656) | Metadata |
HloModule | 0xeabef00 | 86 | ApiConverter::ToC(xla::HloModule const&) over executable.hlo_module_ | Metadata |
Free | 0xeabef60 | 51 | C++ virtual destructor via vtable +8, then free() | Lifecycle |
FreeXlaShapeIndexArray | 0xeabea00 | 10 | bare free(ptr) | Lifecycle |
FreeMaybeOwningDeviceAddressArray | 0xeabea20 | 10 | bare free(ptr) | Lifecycle |
NOTE — the
PJRT_TpuExecutable_*family (PJRT_TpuExecutable_RunHloCostAnalysis_Args,..._GetCompiledMemoryStats_Args,..._GetHloModuleWithConfig_Args,..._SetTpuCompilationEnv_Args, and friends) is a different surface and is not on this roster. Those are PJRT-extension argument structs handled bypjrt::(anonymous namespace)::RunHloCostAnalysis/GetCompiledMemoryStats/ etc., and belong to the PJRT pages. The C-ABI roster owned here is exactly the nine functions whose dynamic-symbol name begins withTpuExecutable_(noPJRT_prefix). Filtering on the prefix is the only reliable separator — the two families share verbs (HloModule, cost analysis) but live on opposite sides of the ABI.
QUIRK — the wrapper class is named
xla::TpuExecutable(soTpuExecutable_HloModule↔ a member that readshlo_module_), but the execution entries bounce one or two layers deeper thanxla::TpuExecutable's own methods:ExecuteAsyncOnStreamlands on thexla::legacy::TpuExecutableInterfacebase (the class that implements marshaling), andLoadProgramAndEnqueueToStreamlands on thexla::jellyfish::DeepseaExecutableleaf. A reimplementer who maps everyTpuExecutable_*C function 1:1 onto anxla::TpuExecutablemethod will mis-place the two big ones — they target the interface/leaf, not the wrapper.
2. Execution Entries
Purpose
The two execution functions are the only large bodies in the roster (4496 B and 4708 B); the other seven total under 700 B combined. Both are C-ABI marshalling layers: they receive a flat launch/run struct from the SE shim, un-marshal it into the typed C++ objects the device runtime expects (xla::ExecutableRunOptions, xla::ExecutionInput vectors, DeviceAddressBase spans, a deserialized DeviceAssignment), dispatch through a vtable slot into the actual enqueue logic, and re-marshal the result (ExecutionOutput / absl::Status) back into C structs for the caller. Neither contains the device-enqueue algorithm — they are the C boundary over it.
Entry Point
TpuExecutable_LoadProgramAndEnqueueToStream 0xeaafba0 (tpu_execute_c_api.cc)
└─ un-marshal C structs → dispatch through the handle vtable
└─ vtable+96 ──► xla::jellyfish::DeepseaExecutable::LoadProgramAndEnqueueToStream
0x13426260 (deepsea_executable.cc) ── real load+enqueue
TpuExecutable_ExecuteAsyncOnStream 0xeabd500 (tpu_executor_c_api.cc)
└─ vtable+24 ──► xla::legacy::TpuExecutableInterface::ExecuteAsyncOnStream
0x1342cd20 ── arg/output marshal + dispatch
└─ (leaf) DeepseaExecutable::LoadProgramAndEnqueueToStream 0x13426260
Algorithm
The C-ABI shell is uniform; the device work is on the runtime pages. The shell shape for ExecuteAsyncOnStream:
// TpuExecutable_ExecuteAsyncOnStream 0xeabd500
function TpuExecutable_ExecuteAsyncOnStream(handle /*a1*/, run_opts /*a2*/,
inputs /*a3*/, n_inputs /*a4*/,
out /*a5*/, status /*a6*/):
exe = *handle // opaque box → xla::Executable subtree
// un-marshal C structs into ExecutableRunOptions + vector<ExecutionInput>
cpp_opts = FromC(run_opts)
cpp_inputs = FromC(inputs, n_inputs)
// dispatch through the executable vtable slot +24 (ExecuteAsyncOnStream)
result = (*exe->vtable[+24])(exe, &cpp_opts, move(cpp_inputs)) // 0x1342cd20
// re-marshal ExecutionOutput / Status back to C
if result.ok: ToC(result.value, out)
else: set_status(status, result.status)
LoadProgramAndEnqueueToStream follows the same pattern but dispatches through vtable slot +96 into the DeepseaExecutable leaf, and its un-marshal step is heavier — it rebuilds an HloModule (for the shape signature only), an ExecutableRunOptions, a deserialized DeviceAssignment, and two DeviceAddressBase vectors. The full line-by-line reconstruction of both bodies — including the absl::Status handling and the device-side DeepseaExecutor::LoadProgram → DeepseaStream::EnqueueRequest chain — lives on the runtime pages.
Function Map
| Function | Size | Role |
|---|---|---|
TpuExecutable_LoadProgramAndEnqueueToStream @ 0xeaafba0 | 4496 | C-ABI marshal → DeepseaExecutable::LoadProgramAndEnqueueToStream (vtable+96) |
TpuExecutable_ExecuteAsyncOnStream @ 0xeabd500 | 4708 | C-ABI marshal → TpuExecutableInterface::ExecuteAsyncOnStream (vtable+24) |
Considerations
The two entries are two public doors into one device runtime. ExecuteAsyncOnStream is reached by the xla::LocalClient / xla::Service StreamExecutor stack (LocalExecutable::RunAsync → Executable::ExecuteAsyncOnStreamWrapper → vtable+24); the C-ABI shim is a second door that un-marshals C structs and calls the same +24 slot. LoadProgramAndEnqueueToStream is the lower-level enqueue primitive that the interface's ExecuteAsyncOnStream itself eventually drives via the leaf. The modern PJRT path (xla::TpuClient over tpu::System) does not route through either of these — it reaches the same device runtime through tpu::System::Execute (0x1d0b33e0). A reimplementer wiring the legacy SE backend reproduces these two C functions; one wiring PJRT ignores them.
3. Serialization Entries
Purpose
Serialize and Deserialize round-trip the executable through xla::DeepseaExecutableProto, the on-the-wire form. Serialize produces a heap-owned proto the caller later frees; Deserialize parses a byte buffer into a fresh executable, returning a C-ABI status object. Both are short (178 B / 288 B) because the real work is in the DeepseaExecutable::ToProto / FromProto C++ methods; the C functions are proto-allocation + status plumbing.
Algorithm
// TpuExecutable_Serialize 0xeabea80
function TpuExecutable_Serialize(self /*_XMM0/a1*/, out_proto /*a3*/):
*out_proto = NULL
proto = operator new(0x28) // 40-byte DeepseaExecutableProto
DeepseaExecutableProto::DeepseaExecutableProto(proto, 0)
*out_proto = proto
tmp = DeepseaExecutable::ToProto(self) // 0x134282e0 — fill a stack proto
// move-or-copy tmp into the heap proto, fast-pathing the arena-equal case
if tmp.arena == proto.arena: InternalSwap(proto, tmp) // same arena → swap
else: CopyFrom(proto, tmp) // cross-arena → deep copy
~DeepseaExecutableProto(tmp)
// TpuExecutable_Deserialize 0xeabede0
function TpuExecutable_Deserialize(len /*a1*/, bytes /*a2*/, out /*a3*/, status /*a4*/):
proto = DeepseaExecutableProto(0) // stack proto
if !MessageLite::ParseFromArray(proto, bytes, len):
set_status(status,
MakeErrorImpl<INTERNAL>( // code 13
"TpuExecutable_Deserialize: proto deserialization failed"))
goto cleanup
exe = DeepseaExecutable::FromProto(proto) // 0x134283e0 → StatusOr
if exe.ok():
box = operator new(8) // 8-byte handle box
*box = exe.value; *out = box // hand off ownership
else:
set_status(status, exe.status)
if exe.value: (*exe.value->vtable[+8])(exe.value) // destroy partial leaf
cleanup:
~DeepseaExecutableProto(proto)
GOTCHA — the deserialized handle is an 8-byte heap box (
operator new(8u)) whose single slot holds the C++DeepseaExecutable*. That is the same box shapeTpuExecutable_Freeexpects:*handleis the C++ object, the box itself is the thingfree()releases. A reimplementer that returns the bare C++ pointer instead of boxing it will makeFreedereference a non-box and corrupt the heap. The box is one pointer of indirection on purpose — it lets the host hold an opaqueXLA_TpuExecutable*while the plugin keeps the real object behind it.
NOTE —
Deserializereturns the standard SE-shim status object viaa4(a refcountedabsl::status_internal::StatusRep**), withcode 13=INTERNALfor a proto-parse failure. The decompiled body is dominated byStatusRep::Unref/_InterlockedIncrementrefcount juggling, not by deserialization logic — that is entirely insideDeepseaExecutable::FromProto.
Function Map
| Function | Size | Role |
|---|---|---|
TpuExecutable_Serialize @ 0xeabea80 | 178 | ToProto + arena-aware swap/copy into a heap DeepseaExecutableProto |
TpuExecutable_Deserialize @ 0xeabede0 | 288 | ParseFromArray + FromProto; boxes the result, sets a StatusRep out-param |
Considerations
The proto type xla::DeepseaExecutableProto is the executable serialization, distinct from the tensorflow::TPUExecutableProto that backs the TpuProgram_* handle (which is the program serialization fed to the compilation cache). Serialize here produces the executable proto for transport/caching of a loaded executable; Deserialize reconstructs one from bytes. The arena-equal fast path in Serialize (InternalSwap vs CopyFrom) is a proto2 idiom — when the source and destination share an arena the move is a pointer swap, otherwise it is a deep copy. Reproduce both branches: a swap across distinct arenas would alias freed memory.
4. Metadata Entries
Purpose
Two read-only accessors. Fingerprint returns the executable's cached fingerprint (a string/Cord pair) for cache keying; HloModule marshals the executable's xla::HloModule out into an XLA_HloModule C struct for the host to inspect. Fingerprint allocates nothing — it returns pointers into the executable's own fields; HloModule is the only metadata function that allocates (through ApiConverter::ToC).
Algorithm
// TpuExecutable_Fingerprint 0xeabea40
function TpuExecutable_Fingerprint(handle /*a1*/, out_ptr /*a2*/, out_len /*a3*/):
fp = *(*handle + 96) // DeepseaExecutable cached fingerprint field
if fp.is_long_form(fp+671 >= 0): // SSO discriminator byte
*out_len = fp+671 // inline (short) string: len byte, data at fp+648
*out_ptr = fp+648
else:
*out_len = *(fp+656) // heap string: length at +656
*out_ptr = *(fp+648) // data ptr at +648
return *out_ptr
// TpuExecutable_HloModule 0xeabef00
function TpuExecutable_HloModule(out /*a1*/, handle /*a2*/):
mod = *(*handle + 8) // xla::Executable::hlo_module_
if mod == NULL:
LOG(FATAL) "hlo_module_ != nullptr" // executable.h:335
ApiConverter::ToC(out, mod) // fill caller-provided XLA_HloModule
return out
QUIRK —
Fingerprintreads the cached fingerprint directly out of the C++ object at offset+96and decodes astd::string's short-string-optimization layout inline (the sign of the byte at+671discriminates inline vs. heap, length at+671or+656, data at+648or*(+648)). It does not callDeepseaExecutable::fingerprint(0x13428a80) at call time — that method populates the field; this accessor just exposes it. A reimplementer who recomputes the fingerprint on everyFingerprintcall will be correct but slow; the binary reads a precomputed field and returns borrowed pointers (so the caller must not free them).
GOTCHA —
HloModuleLOG(FATAL)s ifhlo_module_(atexecutable+8) is null — it does not return an error status. The HLO module is retained only when the executable was built with module retention enabled; an executable compiled without it will crash here, not fail gracefully. The fatal-log source anchor isexecutable.h:335. This is upstream XLA behavior surfaced through the C ABI unchanged.
Function Map
| Function | Size | Role |
|---|---|---|
TpuExecutable_Fingerprint @ 0xeabea40 | 54 | borrow (ptr,len) of the cached fingerprint field at obj+96; SSO-decoded |
TpuExecutable_HloModule @ 0xeabef00 | 86 | ApiConverter::ToC of hlo_module_ into a caller XLA_HloModule; FATAL if null |
Considerations
Fingerprint returns borrowed pointers into the executable; there is no paired free, and the data is valid only while the handle lives. HloModule is the lone allocator: ApiConverter::ToC(xla::HloModule const&) writes into a caller-provided XLA_HloModule out-struct (passed as the leading sret argument) and fills it with heap-owned interior (the marshalling convention on the shim overview), so the host must run the matching Destroy/free for the XLA_HloModule. The related richer metadata accessors (GetCompiledMemoryStats, GetCostAnalysis, GetOutputLayouts, …) are C++ methods on xla::TpuExecutable (0xf8a7560–0xf8aab40) reached through the PJRT extension surface, not through this C-ABI roster — see the note in §1.
5. Lifecycle and Array-Free Entries
Purpose
Three deallocators. Free tears down a full executable handle (destruct the C++ object, then free the box); the two FreeArray helpers free arrays that other roster functions (on this and sibling pages) hand back to the caller as raw heap pointers.
Algorithm
// TpuExecutable_Free 0xeabef60
function TpuExecutable_Free(box /*ptr*/):
if box != NULL:
if *box != NULL:
(*(*box)->vtable[+8])(*box) // C++ virtual destructor (delete-style)
free(box) // free the 8-byte handle box
// TpuExecutable_FreeXlaShapeIndexArray 0xeabea00
function TpuExecutable_FreeXlaShapeIndexArray(p): if p: free(p)
// TpuExecutable_FreeMaybeOwningDeviceAddressArray 0xeabea20
function TpuExecutable_FreeMaybeOwningDeviceAddressArray(p): if p: free(p)
NOTE — the two array-free helpers are byte-for-byte identical (
if (a1) free(a1);, 10 bytes each at adjacent addresses0xeabea00/0xeabea20). They are kept as two distinct symbols deliberately: each pairs with a different array-producing call elsewhere in the ABI (aXLA_ShapeIndex[]and aSE_MaybeOwningDeviceAddress[]), so a reimplementer must keep both names even though the implementation is the samefree. Collapsing them into one function would break the ABI's named-pairing contract, not the runtime behavior.
GOTCHA —
TpuExecutable_Freeexpects the boxed handle:boxis theoperator new(8)box fromDeserialize,*boxis the C++DeepseaExecutable*, and the virtual destructor is invoked through(*box)->vtable[+8]. The order is fixed — destruct the object via the vtable slot first, thenfree()the box. Callingfree(box)before the virtual destructor leaks the C++ object and all its device-side state (loaded programs, output buffers); invoking the destructor on the box instead of on*boxcalls a destructor on garbage.
Function Map
| Function | Size | Role |
|---|---|---|
TpuExecutable_Free @ 0xeabef60 | 51 | virtual-destruct *box via vtable+8, then free(box) |
TpuExecutable_FreeXlaShapeIndexArray @ 0xeabea00 | 10 | if (p) free(p) — frees a XLA_ShapeIndex[] |
TpuExecutable_FreeMaybeOwningDeviceAddressArray @ 0xeabea20 | 10 | if (p) free(p) — frees a SE_MaybeOwningDeviceAddress[] |
Considerations
Free is the destructor for the whole roster: every handle minted by Deserialize (and by the compiler/program path that produces a loaded executable) is reclaimed through it. The two array frees exist because the C ABI cannot return owning C++ containers — a function that conceptually returns vector<ShapeIndex> returns a raw XLA_ShapeIndex* + count and a named free, mirroring the *FreeArray pattern on TpuProgram_* and TpuTopology_*. Pair each producer with its named free; do not cross them (a ShapeIndex[] freed by the device-address helper is still just free, but the named contract documents intent and may diverge in other builds).
Related Components
| Name | Relationship |
|---|---|
xla::TpuExecutable (0xf8a7560–0xf8adce0) | the legacy StreamExecutor wrapper class the roster mirrors; also the home of the PJRT-only metadata methods |
xla::legacy::TpuExecutableInterface::ExecuteAsyncOnStream @ 0x1342cd20 | the marshaling base that TpuExecutable_ExecuteAsyncOnStream dispatches into (vtable+24) |
xla::jellyfish::DeepseaExecutable (leaf) | concrete impl: LoadProgram 0x13426260, ToProto 0x134282e0, FromProto 0x134283e0, fingerprint 0x13428a80 |
xla::DeepseaExecutableProto (40 bytes) | the serialization wire form behind Serialize / Deserialize |
ApiConverter::ToC(xla::HloModule const&) | marshals hlo_module_ into the XLA_HloModule C struct for TpuExecutable_HloModule |
TpuProgram_* roster | the serializable compiled-program handle this running executable wraps |
Cross-References
- The TfTpu C-API Shim — the opaque-handle +
*ApiFn()accessor model that frames this roster (the three-hop accessor → slot → C-ABI impl pattern) - TpuProgram Roster — the
TpuProgram_*compiled-program handle (tensorflow::TPUExecutableProtobase) that this executable wraps - TpuCompiler Roster — the
TpuCompiler_*/TpuCompile_*surface that produces the programs loaded into these executables - TpuExecutor Roster — the
TpuExecutor_*per-device runtime the executable enqueues onto - Execute Async on Stream — the runtime algorithm behind
TpuExecutable_ExecuteAsyncOnStream(this page is the C-ABI roster over it) - Load Program & Enqueue — the runtime algorithm behind
TpuExecutable_LoadProgramAndEnqueueToStream(load + device enqueue) - PJRT Executable Execution — the modern PJRT executable lifecycle that coexists with, and does not route through, this legacy SE path