StreamExecutor → PJRT Adapter
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d,libtpu_lts_20260413_b_RC00). Other wheels will differ.
Abstract
In upstream XLA, the bridge from a StreamExecutor LocalClient / Platform to the modern PjRtClient is xla::PjRtStreamExecutorClient (and the TPU-specific TfrtTpuClient): a PjRtClient subclass that holds a LocalDeviceState per device, each owning a stream_executor::StreamExecutor and a small pool of stream_executor::Streams, and forwards execution through Stream::ThenLaunch / Stream::ThenMemcpy. A reimplementer arriving at libtpu.so expecting to find that class will not find it. There is no xla::PjRtStreamExecutorClient in this binary — the symbol is undefined (confirmed: no defined PjRtStreamExecutorClient symbol exists; the only residual SE-PjRt symbol is PjRtStreamExecutorDeviceDescription, dead GPU-path code). The "StreamExecutor → PJRT adapter" is instead xla::TpuClient, derived from xla::CommonPjRtClient : xla::PjRtClient, and it sits directly on the TFRT-native TPU device runtime tpu::System through tsl::AsyncValue events — bypassing the legacy stream_executor::tpu::TpuExecutor / TpuStream / TpuTransferManager stack entirely.
This page documents the bridge layer: the GetTpuPjRtClient construction path that turns a parsed config into a wired TpuClient, the TpuClient : CommonPjRtClient : PjRtClient class shape and its 3-arg-then-vector TFRT constructor, the device-enumeration loop that wraps each tpu::TpuCoreLocation into an xla::TpuDevice, and how an SE Stream/Executor would have been bound under each PjRtDevice — except that here a TpuDevice binds a shared tpu::System* plus a TpuCoreLocation and a throttle Semaphore, not a per-device executor. The C-ABI surface above this (PJRT_Client_Create, the 208-byte PJRT_Client wrapper, the device/memory accessor slots) is owned by Client, Device & Topology; the SE platform registration and host interpreter below it by StreamExecutor & Host Interpreter; the per-execution hot path by Execute Async On Stream. This page owns the seam in the middle: GetTpuPjRtClient → TpuClient → TpuDevice-wraps-tpu::System.
For reimplementation, the contract is:
- The construction chain.
GetTpuPjRtClient(config)→ one-shotHostContextbuild →TpuStatesManager/TpuSystemStatesingleton →GetTpuPjRtClientInternaldevice enumeration →make_unique<TpuClient>(…)→ optional MegaScale / Tf decorator →CreateWrapperClient. - The class hierarchy and ctor shape.
TpuClient : CommonPjRtClient : PjRtClient; the eight-argument TFRT constructor that takes the device vector, host context, host allocator, system state, and states manager. - The device-wrapping rule. One
xla::TpuDevicepertpu::TpuCoreLocationfromTpuTopology::logical_devices, each holding a sharedtpu::System*(not a private executor), bound to the client by a post-constructionSetClientback-edge that also installs theAsyncWorkRunner. - The SE→TPU concept mapping. Which SE abstraction each TPU-runtime object replaces, so a reader carrying the upstream model can translate it forward.
| Bridge entry | xla::GetTpuPjRtClient(const PjRtTpuClientConfig&) @ 0xF8008C0 (1226 B) |
| Device enumeration | xla::GetTpuPjRtClientInternal(…) @ 0xF800DA0 |
| Adapter client class | xla::TpuClient — ctor 0xF801980, vtable 0x2177B598, typeinfo 0x2177BA40 |
| Client base framework | xla::CommonPjRtClient — vtable 0x2178A108, typeinfo 0x21789EB0 |
| Abstract PJRT base | xla::PjRtClient — vtable 0x21CA9C98, typeinfo 0x21CA9E28 |
| Adapter device class | xla::TpuDevice — ctor 0xF7FDC40 (object 0x1E0 = 480 B), vtable 0x2177B4D0 |
| Device→client bind | xla::TpuDevice::SetClient(TpuClient*, AsyncWorkRunner*) @ 0xF7FE520 |
| Device runtime backing | tpu::System — Initialize @ 0x1D0AE420 (the SE StreamExecutor analogue) |
| C-ABI wrap | pjrt::CreateWrapperClient(unique_ptr<PjRtClient>) @ 0xF872060 |
The SE → TPU-Runtime Concept Mapping
Purpose
The fastest way to read this layer is as a delta from the upstream PjRtStreamExecutorClient model the reader already owns. Every StreamExecutor abstraction has a TPU-runtime counterpart, but the shape differs: SE is a synchronous-stream model (a FIFO of operations per Stream, with Events to cross-stream-order); the TPU runtime is an async-value DAG model (a graph of tpu::TpuEvent dependencies issued through tpu::TpuEventIssuer sequence points). The adapter does not translate one into the other at runtime — it never constructs an SE object at all. It is built natively against the async model.
The Mapping Table
This is the single most important table for a reader carrying the upstream model. Each row is "what you would have looked for" → "what libtpu uses instead" → "the underlying TPU-runtime object."
| Upstream SE / PjRt concept | libtpu adapter object | TPU-runtime backing |
|---|---|---|
PjRtStreamExecutorClient / TfrtTpuClient | xla::TpuClient (: CommonPjRtClient : PjRtClient) | tpu::System (singleton, via TpuSystemState / TpuStatesManager) |
PjRtStreamExecutorDevice | xla::TpuDevice | tpu::System* + tpu::TpuCoreLocation + xla::Semaphore |
LocalDeviceState + StreamExecutor (per device) | (none — no per-device executor) | shared tpu::System + AsyncWorkRunner + tfrt::ConcurrentWorkQueue |
stream_executor::StreamExecutor | (none) | tpu::System (one, shared across all devices) |
Stream::ThenLaunch | (no Stream) | tpu::System::Execute(AsyncValueRef<ProgramHandle>, …) @ 0x1D0B33E0 |
Stream::ThenMemcpy | (no Stream) | tpu::System::TransferTo/FromDevice @ 0x1D0AFA20 / 0x1D0B0160 |
stream_executor::Event | xla::TpuTrackedDeviceEvent | tpu::TpuEvent (a tsl::AsyncValueRef) |
Stream ordering FIFO | (no FIFO) | tpu::TpuEventIssuer sequence points + tsl::AsyncValue deps |
DeviceMemoryBase | tpu::TpuSharedMemoryLocation | TPU HBM address |
PjRtStreamExecutorBuffer | xla::TpuRawBuffer (: CommonPjRtRawBuffer) | tpu::TpuBuffer at a TpuSharedMemoryLocation |
TransferManager | (inline in TpuRawBuffer / TpuClient) | tpu::System::TransferTo/FromDevice |
SE Compiler (registered via factory) | xla::TpuCompiler (: PjRtCompiler) | jellyfish JIT (TpuJitCompileHloWithOptions) |
QUIRK — the empty cells are the point of the table. A reimplementer who builds a
LocalDeviceState-style object per device, each owning aStreamExecutorand a stream pool, will have reproduced upstream XLA, notlibtpu. The TPU adapter shares onetpu::Systemacross everyTpuDevice; there is no per-device executor and noStreamat all. The per-device state that does exist is small: aTpuCoreLocation(which physical core this device maps to) and aSemaphore(an in-flight-launch throttle). Ordering that an SE backend would get from a per-stream FIFO comes instead from a process-wideTpuEventIssuerDAG.
Why the SE Stack Is Absent, Not Adapted
The legacy SE TPU stack does exist in the binary — stream_executor::tpu::TpuExecutor, tensorflow::tpu::TpuStream, tensorflow::tpu::TpuTransferManager, and the Tpu*_* / TpuExecutor_* / TpuStream_* C-ABI exports — but it is a parallel path, not a layer underneath PJRT. It backs xla::LocalClient / xla::Service and the TF-TPU op kernels, reached through the SE TpuPlatform registered at module init (RegisterTpuPlatform → PlatformManager, documented on StreamExecutor & Host Interpreter). The PJRT TpuClient adapter never touches it: across the entire TpuClient / TpuDevice / TpuRawBuffer / TpuLoadedExecutable code range (0xF800DA0–0xF816EC0) there is no reference to ExecutorApiFn, SE_StreamExecutor, or TpuExecutor. The two host abstractions share the same TPU driver core but use different host-side scheduling: TFRT async-values (PJRT) versus SE Streams (legacy). This page documents only the PJRT branch.
GetTpuPjRtClient — The Construction Bridge
Purpose
GetTpuPjRtClient is where the C-ABI options layer hands off to the runtime. By the time it is called, PJRT_Client_Create has already validated the option map and parsed it into a typed PjRtTpuClientConfig (that path is owned by Client, Device & Topology). GetTpuPjRtClient brings up the process-wide infrastructure — the TFRT HostContext and the tpu::System singleton — exactly once, then enumerates devices and constructs the TpuClient. It returns a StatusOr<unique_ptr<TpuClient>> that the C-ABI layer optionally decorates and then wraps.
Entry Point
pjrt::tpu_plugin::PJRT_Client_Create (0xE6A8840) ── slot 15; owns option parse (see client-and-device.md)
└─ xla::GetTpuPjRtClient (0xF8008C0) ── THIS PAGE: brings up HostContext + tpu::System, builds TpuClient
├─ xla::CreateDefaultHostContext (lambda, __cxa_guard one-shot)
│ ├─ tfrt::CreateMallocAllocator
│ └─ tfrt::CreateMultiThreadedWorkQueue(DefaultThreadPoolSize(), …)
├─ xla::GetSingletonTpuStatesManager(bool) (0xF958360) ── mutex+guard singleton
│ └─ TpuStatesManager::GetOrCreateTpuSystemState (0xF956E40)
│ └─ xla::CreateTpuSystemState (0xF95A4A0) ── lazily brings up tpu::System
└─ xla::GetTpuPjRtClientInternal(config, host_ctx, system_state, states_mgr) (0xF800DA0)
├─ tpu::System::topology() / hal_location()
├─ for each TpuCoreLocation in TpuTopology::logical_devices:
│ ├─ xla::TpuDeviceDescription(core_loc, topology, "TPU", "TpuDevice")
│ └─ operator new(0x1E0) + xla::TpuDevice::TpuDevice(desc, name, ordinal, tpu::System*)
├─ make_unique<xla::TpuClient>(config, process_index, platform_version, devices,
│ host_ctx, host_allocator, system_state, states_mgr)
└─ for each device: TpuDevice::SetClient(TpuClient*, AsyncWorkRunner*) (0xF7FE520)
Algorithm
function GetTpuPjRtClient(config): // 0xF8008C0
// (1) One-shot process-wide TFRT HostContext. Guarded by the
// CreateDefaultHostContext::kInstance __cxa_guard so every client
// in the process shares one HostContext / thread pool.
if !kInstance: // guard var, lines 70-83
if cxa_guard_acquire(&kInstance_guard):
alloc = tfrt::CreateMallocAllocator() // line 85
queue = tfrt::CreateMultiThreadedWorkQueue( // line 92
DefaultThreadPoolSize(config), …) // thread count from config
kInstance = make_shared<tfrt::HostContext>(alloc, queue) // line 94
cxa_guard_release(&kInstance_guard)
host_ctx = kInstance // shared_ptr; refcount bumped (line 116)
// (2) The tpu::System singleton, reached through the states manager.
// GetOrCreateTpuSystemState lazily runs CreateTpuSystemState the
// first time, which initialises tpu::System (Initialize @ 0x1D0AE420).
states_mgr = GetSingletonTpuStatesManager(use_global_tpu_system) // 0xF958360, line 212
system_state = states_mgr.GetOrCreateTpuSystemState(opts) // 0xF956E40, line 214
// (3) Tail into device enumeration + client construction.
// The 'a6' boolean selects an own-vs-borrow path for system_state.
return GetTpuPjRtClientInternal(config, host_ctx, // 0xF800DA0, lines 179/236
system_state, states_mgr)
NOTE — the
HostContextis a process-wide singleton, not per-client. The__cxa_guardoverCreateDefaultHostContext::kInstancemeans a secondPJRT_Client_Createin the same process reuses the same malloc allocator, the same multi-threaded work queue, and therefore the sameAsyncWorkRunnerthread pool. A reimplementer must not allocate a fresh thread pool per client or the throttle/scheduling assumptions in the execute path break.
Device Enumeration — GetTpuPjRtClientInternal
GetTpuPjRtClientInternal (0xF800DA0) is the loop that turns physical TPU cores into PjRt devices. It is the closest analogue to upstream's "build a LocalDeviceState per device," except no executor or stream pool is created — only a description and a TpuDevice, each pointed at the one shared tpu::System.
function GetTpuPjRtClientInternal(config, host_ctx, system_state, states_mgr): // 0xF800DA0
system = system_state->system() // tpu::System* (line 136: j+64)
topology = system->topology() // tpu::TpuTopology* (line 138)
hal_loc = system->hal_location() // line 132
devices = vector<unique_ptr<TpuDevice>>{}
cores = topology->logical_devices(TpuCoreType) // line 140; stride 56 bytes (line 148/323)
for core_loc in cores: // each TpuCoreLocation, 56 B apart
// (a) static identity: id/process/kind/attributes/memory
desc = TpuDeviceDescription(core_loc, topology, // line 155
type_prefix="TPU", name="TpuDevice")
// (b) the PjRt device object — 0x1E0 (480) bytes, holds a SHARED system ptr
dev_mem = operator new(0x1E0) // line 156
device = TpuDevice::TpuDevice(desc, name, ordinal, system) // line 206, 0xF7FDC40
devices.push_back(unique_ptr<TpuDevice>{device}) // grows the vector (line 260 = length-error guard)
// record local-device-id -> TpuSharedMemoryLocation (LocalSharedMemory, line 374)
// (c) build the client; ctor takes ownership of the device vector
client = make_unique<TpuClient>( // line 545
config, process_index, platform_version,
devices, host_ctx, host_allocator,
system_state, states_mgr)
// (d) close the back-edge: each device learns its client + work runner
for device in client->devices():
device->SetClient(client.get(), client->async_work_runner()) // 0xF7FE520
return client
GOTCHA — the device→client link is a post-construction back-edge (
SetClient), not a constructor argument. TheTpuClientctor receives the device vector, but eachTpuDevicedoes not know its owning client (or itsAsyncWorkRunner) untilSetClientruns afterward. A reimplementer who tries to usedevice->client()from inside theTpuClientconstructor will read an uninitialised pointer. The two-phase wiring exists because theAsyncWorkRunneris a member of the fully-constructed client; it cannot be handed to the devices before the client object exists.
QUIRK — the
TpuCoreLocationarray is walked with a hard-coded 56-byte stride (lines 148, 323).TpuTopology::logical_devicesreturns a contiguous block, and the loop advancesv18 += 56rather than calling an iterator. That 56-byteTpuCoreLocationsize is an ABI fact a reimplementer of the topology side must match exactly, or the device-enumeration loop reads garbage cores.
After Internal Returns — Decorators and the C Wrapper
Back in PJRT_Client_Create, the bare unique_ptr<TpuClient> is optionally wrapped:
MegaScalePjRtClient::CreateMegaScalePjRtClient(0xE6EA680) when multi-slice is active (FLAGS_megascale_port >= 0and not skipped).TfPjRtClient::CreateTfPjRtClient(0x108524E0) whenuse_tf_pjrt_client == 1(the default).
Then CreateWrapperClient (0xF872060) boxes the outermost xla::PjRtClient* into the 208-byte C-ABI PJRT_Client. These three steps and the wrapper layout are owned by Client, Device & Topology; this page stops at the unique_ptr<TpuClient> that GetTpuPjRtClient produces.
TpuClient — The Adapter Client Class
Purpose
xla::TpuClient is the concrete PjRtClient for TPU. It is the object every C-ABI client slot ultimately virtual-calls into (after unwrapping any MegaScale/Tf decorator and the PJRT_Client POD). It overrides roughly 75 of the CommonPjRtClient framework methods — buffer creation, executable load, device assignment, transfer plumbing — and owns the per-process runtime handles a StreamExecutor client would have split across LocalDeviceStates.
Class Hierarchy
xla::PjRtClient (abstract base; vtable 0x21CA9C98, typeinfo 0x21CA9E28)
▲
xla::CommonPjRtClient (async-value framework; vtable 0x2178A108, typeinfo 0x21789EB0)
│ BufferFromHostBuffer, DefineBuffer, PrepareArguments, CreateOutputs,
│ TrackFuture, ShouldRetryOnOom, GetTransposePlan, … (~79 methods)
▲
xla::TpuClient (the adapter; ctor 0xF801980, vtable 0x2177B598, typeinfo 0x2177BA40)
overrides ~75 methods; holds devices, HostContext, HostMemoryAllocator,
TpuSystemState, TpuStatesManager, Semaphore
Two optional decorators sit outside this hierarchy as wrappers, not subclasses:
| Decorator | Role | Engage condition | Symbol |
|---|---|---|---|
xla::MegaScalePjRtClient | Multi-slice overlay over a unique_ptr<TpuClient> + MultiSliceConfig; forwards/augments compile+execute across slices | FLAGS_megascale_port >= 0, not skipped | CreateMegaScalePjRtClient @ 0xE6EA680; typeinfo 0x215FD068 |
xla::TfPjRtClient | TF-thread-safety decorator (DestroyWrappedBuffersAndClient); owns wrapped buffers | use_tf_pjrt_client == 1 (default) | CreateTfPjRtClient @ 0x108524E0; typeinfo 0x217EFED0 |
xla::AbstractTpuClient (typeinfo 0x215FD080) is a thin base sharing topology helpers — e.g. GetSubsliceTopologyForCompilation (0xE6EE120) — shared between TpuClient and the MegaScale path.
The TFRT Constructor
The constructor signature is byte-confirmed from the demangled symbol at 0xF801980. It is the TFRT shape — async-value-native, taking pre-built infrastructure rather than constructing executors:
TpuClient::TpuClient( // 0xF801980
const PjRtTpuClientConfig& config,
int process_index,
std::string platform_version,
std::vector<unique_ptr<TpuDevice>> devices, // takes ownership of the enumerated devices
std::shared_ptr<tfrt::HostContext> host_context, // the process-wide singleton
std::unique_ptr<HostMemoryAllocator> host_allocator, // host-staging allocator
std::unique_ptr<TpuSystemState, MaybeOwningDeleter> system_state, // may own or borrow
std::shared_ptr<TpuStatesManager> states_manager)
NOTE — the
MaybeOwningDeleteron theTpuSystemStateis the hinge of single-process-many-clients behaviour. Whenuse_global_tpu_systemis set, the system state is borrowed from the global singleton (the deleter is a no-op); otherwise the client owns it. This is why thea6boolean flows fromGetTpuPjRtClientintoGetTpuPjRtClientInternal— it selects the own-vs-borrow construction of theMaybeOwningDeleter.
Object Field Layout — Not Fully Decoded
The constructor body's exact field stores were not dereferenced from the binary. From the ctor signature and the call sites, the object holds: the device vector, the HostContext shared_ptr, the HostMemoryAllocator, the TpuSystemState pointer (with its maybe-owning deleter), the TpuStatesManager shared_ptr, and a per-client Semaphore (the in-flight-computation throttle). The concrete byte offsets of these members are LOW confidence — a reimplementer needs a full decompile of 0xF801980 to place them. What is certain is the membership set and that async_work_runner() and devices() are accessible immediately after construction (they are read in GetTpuPjRtClientInternal's SetClient loop).
Function Map
| Function | Addr | Role |
|---|---|---|
TpuClient::TpuClient (ctor) | 0xF801980 | Build the adapter from enumerated devices + infra |
TpuClient::LookupDevice | 0xF8033A0 | global device id → PjRtDevice* |
TpuClient::platform_name | 0xF816D20 | "tpu" |
TpuClient::platform_id | 0xF816D00 | TPU platform id constant |
TpuClient::process_index | 0xF816C40 | host process index |
TpuClient::memory_spaces | 0xF816CE0 | HBM + pinned/unpinned host spaces |
TpuClient::AllocateBuffer | 0xF7FC5A0 | device HBM alloc → tpu::AllocateBuffer |
TpuClient::AllocateRawBuffer | 0xF7FB1E0 | host/CPU-space alloc dispatch |
TpuClient::CompileAndLoad (XlaComputation) | 0xF804F20 | compile entry → jellyfish JIT |
TpuClient::CompileAndLoad (MLIR) | 0xF8068E0 | MLIR compile entry |
TpuClient::TrackFuture | 0xF7FAD60 | tsl::AsyncValue → PJRT_Event future |
TpuClient::CreateProfiledFuture | 0xF7FAE80 | profiled completion future |
TpuClient::CreateErrorEvent | 0xF808420 | tpu::TpuEvent set to error status |
TpuClient::CreateDeviceEventSet | 0xF813D40 | mint wait/define event sets |
NOTE — the compile, execute, buffer, and event surfaces above are documented in depth on their owning pages (Executable & Execution, Buffer & Memory, Events & Async). They appear here only to show what the adapter client exposes; this page owns their wiring into the SE→PJRT bridge, not their slot-level details.
TpuDevice — Wrapping tpu::System Under Each PjRtDevice
Purpose
xla::TpuDevice is the PjRtDevice the C-ABI device slots wrap. It is the object that, in the upstream SE model, would have held a StreamExecutor and a stream pool. Here it holds no executor: it points at the shared tpu::System, names the one physical core it maps to, and carries a launch-throttle semaphore. This is the crux of the "SE → PJRT adapter" claim — the device-level binding is to an async runtime, not to a stream-based executor.
What a TpuDevice Holds
The TpuDevice object is 0x1E0 (480) bytes (the operator new(0x1E0) in the enumeration loop). Its members, recovered from the ctor signature (0xF7FDC40) and SetClient (0xF7FE520):
| Member | Source | Role |
|---|---|---|
tpu::System* | ctor arg 4 | The shared device runtime — same pointer in every TpuDevice |
tpu::TpuCoreLocation | from the enumeration loop's core_loc | The physical core this PjRt device maps to |
xla::TpuDeviceDescription | ctor arg 1 | Static identity: id/process/kind/attributes/memory |
xla::TpuClient* | SetClient arg 1 | Back-pointer to the owning client |
xla::AsyncWorkRunner* | SetClient arg 2 | Host scheduler for this device's continuations |
xla::Semaphore | constructed; fenced by ExecutablesStart/Complete | In-flight-computation throttle (bounded by max_inflight_computations) |
The tpu::System* being identical across all devices is the structural difference from SE. In upstream XLA, device[i]->local_device_state()->executor() returns a distinct StreamExecutor per device. Here, device[i] and device[j] share one tpu::System; what distinguishes them is the TpuCoreLocation, which tpu::System::Execute and LoadProgram consume to target the right core.
The SetClient Back-Edge
function TpuDevice::SetClient(client, work_runner): // 0xF7FE520
this->client = client // back-pointer used by C-ABI device→client lookups
this->work_runner = work_runner // the client's AsyncWorkRunner (host scheduler)
This runs once per device immediately after the TpuClient is constructed. It is the only way a TpuDevice learns its scheduler. The client() accessor that the C-ABI PJRT_Device slots use to translate an implementation device back to its wrapper (via the client's device_map) reads this field — so it is invalid before SetClient, which is why the construction order in GetTpuPjRtClientInternal (build client, then loop SetClient) is load-critical.
The Launch Throttle — The Closest Thing to a Stream
TpuDevice carries a per-device xla::Semaphore bounded by max_inflight_computations (default 1, from the config). ExecutablesStart (0xF800300) acquires it before a launch; ExecutablesComplete (0xF800740) releases it when the launch's completion event fires. This is the functional stand-in for an SE stream's depth limit: it is what stops the host from queueing unboundedly many executions ahead of the device. It is not a FIFO — ordering between launches is enforced by the TpuEventIssuer sequence points, not by the semaphore. The semaphore only bounds concurrency.
QUIRK — with
max_inflight_computations == 1(the default), a singleTpuDevicebehaves much like a single SE stream: at most one execution in flight, the next blocked until the previous completes. But the resemblance is shallow — raising the knob makes the device accept multiple concurrent launches with no FIFO ordering between them, which an SE stream could never do. A reimplementer must treat the throttle as a concurrency cap, not as a serialising queue.
Device Memory Spaces
Each TpuDevice attaches three memory spaces (the per-device AttachMemorySpace calls):
xla::TpuHbmMemorySpace(kind"tpu_hbm") — device HBM.xla::PinnedHostMemorySpace/xla::UnpinnedHostMemorySpace— host staging.
The canonical kind constants are xla::kBuiltinTpuMemorySpaces / xla::kTpuHbmMemorySpaceKind. The slot-level memory accessors and the buffer objects that live in these spaces are owned by Buffer & Memory.
Function Map
| Function | Addr | Role |
|---|---|---|
TpuDevice::TpuDevice (ctor) | 0xF7FDC40 | Build device from desc + name + ordinal + tpu::System* |
TpuDevice::SetClient | 0xF7FE520 | Install client back-pointer + AsyncWorkRunner |
TpuDevice::ExecutablesStart | 0xF800300 | Acquire the in-flight semaphore before launch |
TpuDevice::ExecutablesComplete | 0xF800740 | Release the in-flight semaphore on completion |
NOTE — the
TpuDeviceDescription(vtable0x21787AC0, typeinfo0x21787C38) is the PjRt device description carrying id / process_index / device_kind / Attributes / memory. The SE-flavouredPjRtStreamExecutorDeviceDescription(vtable0x2177D950, typeinfo0x2177D9B0) is GPU-only dead code in this binary — its presence is the only residual SE-PjRt symbol, and it is never reached on the TPU path. Do not confuse the two when reading the binary.
The Runtime Backing — tpu::System as the StreamExecutor Analogue
Purpose
tpu::System (Initialize @ 0x1D0AE420) is the object the whole adapter sits on — the TFRT-native TPU device runtime. It is what the page title calls "StreamExecutor," mapped forward: it is the surface that offers launch, transfer, allocate, and event primitives, but async-value-shaped instead of stream-shaped. One tpu::System is shared by every TpuClient and every TpuDevice in the process (it is reached through the TpuStatesManager / TpuSystemState singleton built in GetTpuPjRtClient).
The API Surface (the SE-Stream-equivalent operations)
| Operation | VA | SE equivalent |
|---|---|---|
Initialize(shared_ptr<TpuTopology>, InitOptions, ConcurrentWorkQueue*) | 0x1D0AE420 | StreamExecutor::Init |
Execute(AsyncValueRef<ProgramHandle>, ExecuteOptions, Span inputs, outputs, wait, define) | 0x1D0B33E0 | Stream::ThenLaunch |
LoadProgram(TpuCoreLocation, shared_ptr<TpuCoreProgram>) | 0x1D0B2240 | StreamExecutor::GetKernel (load once) |
TransferToDevice(AsyncValueRef<TpuBuffer>, Span<uint8>, opt<TpuSyncFlagOnChip>) | 0x1D0AFA20 | Stream::ThenMemcpy (H2D) |
TransferFromDevice(Span<uint8>, AsyncValueRef<TpuBuffer>, opt<…>) | 0x1D0B0160 | Stream::ThenMemcpy (D2H) |
AllocateHostBuffer / MakeTpuHostBuffer | 0x1D0AF180 / 0x1D0AF660 | host-pinned alloc |
EnqueueInfeed / DequeueOutfeed | 0x1D0B5D00 / 0x1D0B5F00 | infeed/outfeed |
Execute takes the TpuCoreLocation (indirectly, via the ProgramHandle loaded for that core) — this is how one shared tpu::System services many TpuDevices without per-device executors. The internal hardware command-stream / ring layout behind Execute is private to tpu::System::Impl and the TPU driver core; it was not traced (it is a separate concern from this bridge layer).
NOTE — the device-side ordering primitive is
tpu::TpuEventIssuer(AddDepsNoReserve,AggregateDeps,RunWhenDepsReady,NextSequencePoint,FulfillArgs,ChainScope). It is the sequence-point engine that ordersExecute/Transfer/Allocateagainst each other — the functional replacement for an SE per-stream FIFO plus cross-streamEvents, expressed as a DAG oftpu::TpuEvent(each atsl::AsyncValueRef). Its exact sequencing algorithm was not byte-traced. The host-side scheduler that drives dependent continuations isxla::ThreadPoolAsyncWorkRunner(AsyncWorkRunner::ExecuteWhenReady), fed by theHostContext'stfrt::ConcurrentWorkQueue. Both are reached throughTpuClient::async_work_runner().
Considerations for a Reimplementer
- Do not build a
LocalDeviceState. The single most likely mistake is to reproduce upstream XLA's per-deviceStreamExecutor+ stream pool. The TPU adapter has none. Build one sharedtpu::System, hand its pointer to everyTpuDevice, and distinguish devices byTpuCoreLocationalone. - Singletons are process-wide, not per-client. Both the
HostContext(viaCreateDefaultHostContext::kInstance) and thetpu::System(viaTpuStatesManager) are__cxa_guard/ mutex one-shots. A second client in the process must reuse them, not rebuild them. TheMaybeOwningDeleteronTpuSystemStateencodes the borrow. - The device↔client wiring is two-phase. Construct the client with the device vector, then loop
SetClientto install the back-pointer and theAsyncWorkRunner. Devices are not fully usable until that loop runs. - Concurrency comes from a semaphore, ordering from an event DAG. The per-device
Semaphorecaps in-flight launches;TpuEventIssuersequence points order them. These are orthogonal — conflating the throttle with a FIFO will produce wrong ordering atmax_inflight_computations > 1. - The legacy SE stack is a decoy.
TpuExecutor/TpuStream/TpuTransferManagerexist in the same binary but serveLocalClient/ TF kernels, never PJRT. A reimplementer of the PJRT path should ignore everyExecutorApiFn/SE_*symbol.
Related Components
| Component | Relationship |
|---|---|
PJRT_Client_Create (slot 15) | Calls GetTpuPjRtClient; owns the option parse and the C-ABI wrapper above this bridge |
xla::CommonPjRtClient | The async-value framework TpuClient derives from; supplies buffer/executable/event plumbing |
MegaScalePjRtClient / TfPjRtClient | Optional decorators stacked on the bare TpuClient by PJRT_Client_Create |
tpu::System | The shared device runtime every TpuDevice points at; the SE-StreamExecutor analogue |
tpu::TpuEventIssuer | Device-side sequence-point engine; replaces SE per-stream FIFO + events |
Legacy SE TpuPlatform | The parallel, non-PJRT device layer; never referenced by TpuClient |
Cross-References
- Client, Device & Topology — owns
PJRT_Client_Create, the option-kv ingest, and the 208-bytePJRT_Clientwrapper that boxes theTpuClientthis page builds - StreamExecutor & Host Interpreter — the legacy SE
TpuPlatformregistration and host-interpreter path that runs parallel to (never under) this adapter - Executable & Execution —
PJRT_LoadedExecutable_Executeand theTpuExecutable/TpuLoadedExecutable/TpuRawLoadedExecutableframework the adapter client loads programs into - Execute Async On Stream — the per-execution path down to
tpu::System::Execute, the runtime side of the launch primitive this page maps fromStream::ThenLaunch - Buffer & Memory —
TpuRawBufferovertpu::TpuBuffer, the memory spaces eachTpuDeviceattaches, and the allocator glue - Events & Async —
PJRT_Event↔tsl::Future<void>, theTpuTrackedDeviceEvent/TpuEventIssuermodel this page references for ordering - Overview — the PJRT plugin entry, extension chain, and
GetPjrtApipopulation