ExecuteAsyncOnStream
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (libtpu_lts_20260413_b_RC00, BuildID89edbbe81c5b328a958fe628a9f2207d, ELF x86-64, ~745 MB). Other builds will differ.
Abstract
ExecuteAsyncOnStream is the single per-execution C++ entry point of libtpu's legacy StreamExecutor execution path — the xla::Executable virtual that turns a vector of xla::ExecutionInput buffers into a populated xla::ExecutionOutput and enqueues the device program. It is the TPU analogue of upstream XLA's xla::Executable::ExecuteAsyncOnStream, reached through xla::LocalClient::Compile → xla::LocalExecutable::RunAsync, not through the modern PJRT C-API. The concrete override is xla::legacy::TpuExecutableInterface::ExecuteAsyncOnStream @ 0x1342cd20 (3650 B), and it is the function this page reconstructs.
NOTE — the task framing calls this "
TpuExecutable::ExecuteAsyncOnStream" and referencesxla::PjRtStreamExecutorLoadedExecutable. Neither symbol exists in this binary (HIGH — exhaustivergover 884,843 decompiled functions returns zero hits forPjRtStreamExecutorLoadedExecutable). The PJRT client in this build isxla::TpuClient(derived fromxla::CommonPjRtClient : xla::PjRtClient) over the TFRT-nativetpu::Systemasync-value runtime, which does not route throughExecuteAsyncOnStreamat all — its execute path isPJRT_LoadedExecutable_Execute→CommonPjRtLoadedExecutable::Execute→tpu::System::Execute. TheExecuteAsyncOnStreamvirtual that is present belongs to the parallelxla::LocalClient/xla::ServiceStreamExecutor stack, whose TPU executable isxla::legacy::TpuExecutableInterface(and its subclassxla::jellyfish::DeepseaExecutable). This page documents that real entry; where the contract names a PJRT concept, it is mapped to the StreamExecutor object that fills the role.
The entry does four things in order: marshal the host ExecutableRunOptions and ExecutionInput arguments into a buffer tree, allocate the output ScopedShapedBuffer with input-buffer donation/aliasing, dispatch the device work through a vtable slot into the enqueue lower half, and assemble an ExecutionOutput (or an absl::Status) for the caller. The enqueue lower half — DeepseaExecutable::LoadProgramAndEnqueueToStream @ 0x13426260 — is on Load Program and Enqueue; this page owns the entry, the argument/output marshaling, and the dispatch into that layer.
For reimplementation, the contract is:
- The dispatch lattice. Three callers reach one virtual:
LocalExecutable::RunAsync→Executable::ExecuteAsyncOnStreamWrapper→ vtable slot +24 (ExecuteAsyncOnStream), and a second public door, the C-ABI shimTpuExecutable_ExecuteAsyncOnStream@0xeabd500, which calls the same +24 slot after un-marshaling C structs. - Argument marshaling. How an
xla::ExecutionInputvector is walked into a flatDeviceAddressBasearray indexed byIndexTable, plus the dynamic-shape side channel. - Output construction.
AllocateOutputMemoryWithInputReuse(@0x1342ba00) building aScopedShapedBuffer, the input→output aliasing fixups driven byHloInputOutputAliasConfig, andMarkToBeReleasedArguments. - The hand-off. The vtable +96 indirect call into
LoadProgramAndEnqueueToStreamand the move of the returned root buffer into the caller'sExecutionOutput.
| Entry point | xla::legacy::TpuExecutableInterface::ExecuteAsyncOnStream @ 0x1342cd20 (3650 B) |
| C-ABI shim | TpuExecutable_ExecuteAsyncOnStream @ 0xeabd500 (4708 B) — tpu_executor_c_api.cc |
| Public C++ wrapper | xla::Executable::ExecuteAsyncOnStreamWrapper @ 0x1dad98a0 (579 B, ExecutionInput overload) |
| Client driver | xla::LocalClient → xla::LocalExecutable::RunAsync @ 0x1084d140 (2489 B) |
| Vtable dispatch slot | +24 (enter ExecuteAsyncOnStream); +96 (leaf LoadProgramAndEnqueueToStream) |
| Output allocator | TpuExecutableInterface::AllocateOutputMemoryWithInputReuse @ 0x1342ba00 (4828 B) |
| Enqueue lower half | xla::jellyfish::DeepseaExecutable::LoadProgramAndEnqueueToStream @ 0x13426260 (7512 B) |
| Arg type | std::vector<xla::ExecutionInput> (192 B/element) |
| Result type | xla::ExecutionOutput (wraps a ScopedShapedBuffer) returned by value as StatusOr |
| Source file (asserts) | stream_executor/tpu/tpu_executable_interface.cc |
Object Model and Class Hierarchy
Purpose
ExecuteAsyncOnStream is a virtual on the upstream xla::Executable base. On TPU the override lives on xla::legacy::TpuExecutableInterface, an abstract class that implements argument/output marshaling once and defers the actual device enqueue to a pure-virtual leaf implemented by the concrete xla::jellyfish::DeepseaExecutable.
Inheritance
xla::Executable (upstream base; ExecuteAsyncOnStream is virtual @ vtable+24)
└─ xla::legacy::TpuExecutableInterface ── implements ExecuteAsyncOnStream @ 0x1342cd20
│ (marshal args, allocate outputs, dispatch via +96)
└─ xla::jellyfish::DeepseaExecutable ── implements LoadProgramAndEnqueueToStream @ 0x13426260
(the leaf invoked through vtable+96; load-program-enqueue.md)
The legacy:: namespace on TpuExecutableInterface is the binary's own label (mangled ZN3xla6legacy22TpuExecutableInterface…), and is the clearest single signal that this whole code path is the deprecated StreamExecutor execution model retained for LocalClient/Service and the TF-TPU op kernels. The modern PJRT front door uses xla::TpuClient instead.
QUIRK — the two vtable slots are different objects.
ExecuteAsyncOnStreamis dispatched at +24 on theExecutablevtable (called byExecuteAsyncOnStreamWrapperline 45 and the C shim line 703). InsideExecuteAsyncOnStreamthe device work is dispatched at +96 ((*(...)(*v89 + 96)), interface line 650) — that is the pure-virtualLoadProgramAndEnqueueToStream. A reimplementer who collapses the two into one method loses the abstract/concrete split that letsDeepseaExecutableswap the device backend without touching marshaling.
Function Map
| Function | Address | Size | Role |
|---|---|---|---|
xla::legacy::TpuExecutableInterface::ExecuteAsyncOnStream | 0x1342cd20 | 3650 B | The entry: marshal → allocate → dispatch → output |
xla::legacy::TpuExecutableInterface::AllocateOutputMemoryWithInputReuse | 0x1342ba00 | 4828 B | Build output ScopedShapedBuffer, honor donation/aliasing |
xla::jellyfish::DeepseaExecutable::LoadProgramAndEnqueueToStream | 0x13426260 | 7512 B | Pure-virtual leaf (vtable+96) — device enqueue |
xla::Executable::ExecuteAsyncOnStreamWrapper (ExecutionInput) | 0x1dad98a0 | 579 B | Profiled public wrapper around the +24 virtual |
xla::Executable::ExecuteAsyncOnStreamWrapper (ShapedBuffer) | 0x1dad9780 | 259 B | Legacy ShapedBuffer-span overload |
xla::ExecuteWrapperAfterExecution | 0x1dad9b00 | 266 B | Post-exec profiling/HLO-profile finalize |
xla::LocalExecutable::RunAsync | 0x1084d140 | 2489 B | LocalClient driver → wrapper |
TpuExecutable_ExecuteAsyncOnStream | 0xeabd500 | 4708 B | C-ABI shim: C structs ↔ C++ objects |
Entry Point and Dispatch Lattice
Purpose
There is exactly one execution virtual, reached from two directions: the in-process C++ client (LocalExecutable::RunAsync) and the C-ABI boundary (TpuExecutable_ExecuteAsyncOnStream, exported for the StreamExecutor TpuExecutor shim).
Entry Point
xla::LocalExecutable::RunAsync (0x1084d140) ── LocalClient per-call driver
└─ xla::Executable::ExecuteAsyncOnStreamWrapper (0x1dad98a0)
├─ xla::ExecuteWrapperBeforeExecution ── start HLO execution profile
├─ [vtable+24] ExecuteAsyncOnStream ───────────┐
└─ xla::ExecuteWrapperAfterExecution (0x1dad9b00)│ ── finalize profile, stamp Status
│
TpuExecutable_ExecuteAsyncOnStream (0xeabd500) │ ── C-ABI: FromC args, build RunOptions
└─ [vtable+24] ExecuteAsyncOnStream ──────────────────┤ then ToC the ExecutionOutput
│
xla::legacy::TpuExecutableInterface::ExecuteAsyncOnStream (0x1342cd20) <─┘
├─ AllocateOutputMemoryWithInputReuse (0x1342ba00) ── ScopedShapedBuffer + aliasing
├─ xla::Executable::MarkToBeReleasedArguments ── donation bookkeeping
└─ [vtable+96] DeepseaExecutable::LoadProgramAndEnqueueToStream (0x13426260) ── device enqueue
Algorithm — the wrapper
ExecuteAsyncOnStreamWrapper is thin: it brackets the virtual call with the profiling hooks. Both hooks exist even when profiling is disabled (they no-op on a null HloExecutionProfile).
function ExecuteAsyncOnStreamWrapper(self, run_options, args): // 0x1dad98a0
state = ExecuteWrapperBeforeExecution(run_options) // start span; capture stream
out = (*self.vtable[24])(self, run_options, &args) // -> ExecuteAsyncOnStream; moves args
stream = run_options->stream() // line 70
status = ExecuteWrapperAfterExecution(self, &state, // 0x1dad9b00
out.status, stream) // finalize profile
return out // ExecutionOutput by value
NOTE — the wrapper moves out of the caller's
argsvector (lines 36-44 zero the source vector header, then destroy the moved-fromExecutionInputs). The argument vector is consumed by the call; a reimplementer must not reuse it afterward.
The C-ABI shim
TpuExecutable_ExecuteAsyncOnStream (0xeabd500, from tpu_executor_c_api.cc) is the boundary the StreamExecutor TpuExecutor C-shim crosses. It is pure marshaling around the same +24 virtual:
function TpuExecutable_ExecuteAsyncOnStream(self, c_run_opts, c_args[], n, c_out, status_out): // 0xeabd500
run_options.set_device_ordinal(c_run_opts->device_ordinal) // a2+32
if c_run_opts->allocator: // a2+8
run_options.set_allocator(new DeviceAddressAllocator{ // 0x18 B object
GetUnderlyingDeepseaPlatform(), c_run_opts }) // wraps deepsea platform
run_options.set_stream(*c_run_opts->stream) // a2+40
if c_run_opts->host_to_device_stream: // a2+48
run_options.set_host_to_device_stream(...)
if c_run_opts->device_assignment: // a2+56
proto = DeserializeProto<DeviceAssignmentProto>(...) // TpuSerializedProto
run_options.set_device_assignment(DeviceAssignment::Deserialize(proto))
run_options.set_rng_seed(c_run_opts->rng_seed) // a2+72
run_options.set_run_id(c_run_opts->run_id) // a2+80
// -- marshal each C SE_ExecutionInput into an xla::ExecutionInput --
for i in 0..n: // line 261
arg = ExecutionInput(ApiConverter::FromC(c_args[i])) // shape
TF_CHECK_OK(arg.SetDynamicShape(FromC(c_args[i]+560))) // dynamic shape side channel
for each buffer in c_args[i].buffers (stride 72, base +536): // line 312
arg.SetUnownedBuffer / SetBuffer(FromC(buffer)) // MaybeOwningDeviceAddress
for each aliased index (stride 72, base +544, count +552): // line 504
arg.MutableBuffers()->insert(ShapeIndex) // IndexTable entry
args.push_back(move(arg))
out = (*self.vtable[24])(&out, self, &run_options, &args) // line 703 -> ExecuteAsyncOnStream
if out.ok():
scoped = ScopedShapedBuffer(out.result)
ApiConverter::ToC(c_out, scoped.release()) // populate SE_ExecutionOutput
else:
*status_out = out.status // line 887
// destroy args, run_options, allocator
GOTCHA —
SetDynamicShapeis asserted withTF_CHECK_OK(the shim asserts attpu_executor_c_api.cc:1190). A malformed dynamic-shape blob from the C side is a hardLogMessageFatal, not a returned error. The dynamic shape lives at a fixed+560offset from each C argument struct, separate from the static shape at offset0.
Argument Marshaling
Purpose
The interface entry receives args already as std::vector<xla::ExecutionInput>. Its first job is to flatten every leaf buffer of every argument into a single contiguous DeviceAddressBase array that the enqueue layer consumes positionally, while preserving the tree shape via the per-argument IndexTable.
Algorithm
function TpuExecutableInterface::ExecuteAsyncOnStream(self, run_options, args): // 0x1342cd20
n = args.size() // a4[1]
// ---- Stage 1: flatten argument leaf buffers ----
flat = new DeviceAddressBase[n] // 24 B/elem (line 142)
for i in 0..n: // walk arg[i].Buffers()
entry = IndexTable::GetEntry(arg[i].buffers, root, /*index*/0) // tuple_tree.h:332 CHECK
addr = entry.AsDeviceAddress() // MaybeOwningDeviceAddress
flat[i] = addr // moved into flat array
// ---- Stage 2: fetch result shape + aliasing config from the program ----
if program (a2):
result_shape = program->result_shape() // vtable+40 (line 238)
alias_config = program->input_output_alias_config() // HloInputOutputAliasConfig (+2840…)
else:
result_shape = Shape{} // empty (line 249)
alias_config = ShapeTree<optional<Alias>>{} // line 334
CHECK(run_options->allocator() != nullptr) // tpu_executable_interface.cc:219
...
The argument vector element stride is 192 bytes (an xla::ExecutionInput), visible in every destructor loop (192 * count, e.g. lines 666, 711, 945 of the C shim and 53 of the wrapper). Each ExecutionInput carries a Shape, a dynamic Shape, and a ShapeTree<MaybeOwningDeviceAddress> of leaf buffers indexed through xla::internal::IndexTable. The leaf device addresses are 24-byte DeviceAddressBase records (opaque pointer + size + memory-space tag); the flatten loop asserts each AsDeviceAddress().opaque() != nullptr (tuple_tree.h:332).
QUIRK — the flatten array uses a hand-rolled growable vector with the
2*capdoubling policy (v16 = 2*v7, interface line 195) and a0xAAAAAAAAAAAAAAAAlength-error guard (interface lines 139, 196 — the max element count for the 24-byteDeviceAddressBasestride). This is not astd::vector<DeviceAddressBase>with default growth — the leaf count is known up front (24 * n, line 142), so a reimplementation can pre-size exactly and skip the reallocation path entirely (interface lines 142–218). (The0xAAAAAAAAAAAAAAABdivision magic that recovers a /192 element count from a byte span belongs to the wrapper'sExecutionInputvector teardown at0x1dad98a0line 64, not to this 24-byte flat array.)
Output Construction and Aliasing
Purpose
The output ScopedShapedBuffer is allocated before the device runs, so the enqueue layer can write results directly into it. Donation lets an input buffer become an output buffer in place, avoiding an allocation and a copy.
Algorithm
// ---- Stage 3: allocate outputs, reusing donated input buffers ----
out_or = AllocateOutputMemoryWithInputReuse( // 0x1342ba00
result_shape, alias_config,
run_options->allocator(),
args, // donor source
run_options->stream(),
run_options->host_to_device_stream())
if !out_or.ok():
return out_or.status.AddSourceLocation(tpu_executable_interface.cc:228)
output = ScopedShapedBuffer(out_or) // line 344
// ---- Stage 4: re-wire donated input buffers into the output tree ----
if program->aliasing_table (a2+3488, count a2+3496): // line 382
for (param, output_index) in aliasing_table: // 6-qword stride
CHECK(param < args.size()) // ...interface.cc:236
in_entry = IndexTable::GetEntry(args[param], index) // ...:242
CHECK(!in_entry.AsDeviceAddress().is_null()) // ...:243
CHECK(in_entry.is_owning /* offset */) // ...:244
aliased.push_back(in_entry) // donor address list
buffers_to_release.push_back(param_index) // uint32 vector
// ---- Stage 5: bookkeeping + dispatch ----
MarkToBeReleasedArguments(program, args[0], n, output) // line 643
root = output.root_buffer() // line 644
status = (*program.vtable[96])( // line 650 -> LoadProgramAndEnqueueToStream
program, run_options,
flat /*device addresses*/, n,
aliased /*donor list*/, buffers_to_release,
root.opaque())
if status.ok():
result.set_result(move(output)) // ScopedShapedBuffer into ExecutionOutput
else:
result.status = status.AddSourceLocation(...:266)
return result // ExecutionOutput
AllocateOutputMemoryWithInputReuse (0x1342ba00) walks the result Shape with ShapeUtil::ForEachMutableSubshapeHelper (callback at 0x1342dc00); for each leaf subshape it consults the HloInputOutputAliasConfig to decide whether to allocate fresh device memory through the DeviceAddressAllocator or to claim a donated input buffer. The result is a ScopedShapedBuffer — an owning ShapedBuffer whose leaf addresses are RAII-freed on the supplied stream unless explicitly release()d into the ExecutionOutput.
GOTCHA — the aliasing fixup re-reads the input buffer through
IndexTable::GetEntryand assertsit != arguments[parameter].MutableBuffers()->end()(...interface.cc:242). A donation declared in the HLOinput_output_alias_configfor a parameter index that the caller did not actually pass (or passed without that leaf) is a fatalCHECK, not a graceful fallback. The donor must be present and owning (offsettruthy,:244).
QUIRK —
MarkToBeReleasedArgumentsruns before the device enqueue, not after. It records which argument buffers the program is allowed to consume so the caller'sExecutionInputdestructors do not double-free buffers the executable now owns. The actual release is deferred to whoever holds the resultingExecutionOutput. Reimplementing this after the enqueue would race the device against argument teardown.
The Dispatch into the Enqueue Layer
Purpose
The single hand-off from marshaling to device work. Everything above is target-independent buffer plumbing; everything below the +96 call is the DeepseaExecutable device backend.
What crosses the boundary
The vtable+96 call (LoadProgramAndEnqueueToStream) receives, in order: the ServiceExecutableRunOptions (carrying the stream, allocator, device assignment, run id, rng seed), the flat DeviceAddressBase argument array plus its count, the donor-buffer span, the buffers_to_release index vector, and the opaque pointer of the output root buffer. The leaf returns an absl::Status; on success the pre-allocated output is now populated on the device and is moved into the returned ExecutionOutput.
ExecuteAsyncOnStream LoadProgramAndEnqueueToStream (load-program-enqueue.md)
───────────────────── ──────────────────────────────────────────────────────
run_options ───────────────────────────▶ stream / allocator / device_assignment / run_id
flat[] (DeviceAddressBase, n) ──────────▶ positional input device addresses
aliased[] (donated input addrs) ─────────▶ in-place output reuse
buffers_to_release[] (uint32) ──────────▶ donation index list
output.root_buffer().opaque() ──────────▶ output root device pointer (written by device)
◀────────────────────────────── absl::Status (ok ⇒ output is live)
Stream ordering, the on-device program load (tpu::System::LoadProgram / TpuCoreProgram), and the completion event are not established here — they live entirely below the +96 boundary. See the cross-references.
Considerations
- Replica / partition handling.
ExecuteAsyncOnStreamis single-device per call. The device assignment arrives viarun_options->device_assignment()(deserialized from aDeviceAssignmentProtoin the C shim, interface line 236-245). Multi-replica fan-out is the caller's responsibility —LocalExecutable::RunAsyncresolves the stream and device ordinal from the run options before the wrapper call (RunAsync line 195). There is no replica loop inside the entry itself. - Error surface. Two error styles coexist. Buffer-shape and aliasing violations are fatal
CHECK/LogMessageFatal(the binary treats a malformed buffer tree as a programming error). Allocation failure and device-enqueue failure are returnedabsl::StatuswithAddSourceLocationImplstamps (:228for allocation,:266for enqueue) — these propagate to the caller as a failedExecutionOutput. - No StreamExecutor
Streamcreation. Despite the name, this path does not create streams.run_options->stream()is supplied by the caller (LocalExecutable::RunAsyncor the C shim'sc_run_opts->stream). The "async" is the StreamExecutor stream model, distinct from the PJRTtpu::Systemasync-value model documented under the adapter pages.
Related Components
| Name | Relationship |
|---|---|
xla::jellyfish::DeepseaExecutable::LoadProgramAndEnqueueToStream | The vtable+96 leaf this entry dispatches into — the device enqueue lower half |
xla::Executable::ExecuteAsyncOnStreamWrapper | The profiled public C++ wrapper that calls the +24 virtual |
xla::LocalExecutable::RunAsync | The LocalClient per-call driver that resolves stream/ordinal and calls the wrapper |
TpuExecutable_ExecuteAsyncOnStream | The C-ABI shim that marshals C structs across the StreamExecutor boundary |
xla::TpuClient (PJRT path) | The modern execution path; bypasses ExecuteAsyncOnStream entirely via tpu::System::Execute |
Cross-References
- Load Program and Enqueue — the vtable+96 leaf (
LoadProgramAndEnqueueToStream); device program load and command-stream enqueue - Stream Semantics — how
run_options->stream()orders the enqueued work - Completion Loop — the async completion event the enqueue layer produces
- Allocator Integration — the
DeviceAddressAllocatorthatAllocateOutputMemoryWithInputReuseand the C shim build - Host Callbacks — host-side infeed/outfeed and callback dispatch during execution
- PJRT Executable Execution — the modern
PJRT_LoadedExecutable_Execute→CommonPjRtLoadedExecutable::Executepath this entry is the legacy counterpart of - Runtime Overview — where the StreamExecutor execution path sits relative to the PJRT path