Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

StreamExecutor → PJRT Adapter

All addresses on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build-id 89edbbe81c5b328a958fe628a9f2207d, libtpu_lts_20260413_b_RC00). Other wheels will differ.

Abstract

In upstream XLA, the bridge from a StreamExecutor LocalClient / Platform to the modern PjRtClient is xla::PjRtStreamExecutorClient (and the TPU-specific TfrtTpuClient): a PjRtClient subclass that holds a LocalDeviceState per device, each owning a stream_executor::StreamExecutor and a small pool of stream_executor::Streams, and forwards execution through Stream::ThenLaunch / Stream::ThenMemcpy. A reimplementer arriving at libtpu.so expecting to find that class will not find it. There is no xla::PjRtStreamExecutorClient in this binary — the symbol is undefined (confirmed: no defined PjRtStreamExecutorClient symbol exists; the only residual SE-PjRt symbol is PjRtStreamExecutorDeviceDescription, dead GPU-path code). The "StreamExecutor → PJRT adapter" is instead xla::TpuClient, derived from xla::CommonPjRtClient : xla::PjRtClient, and it sits directly on the TFRT-native TPU device runtime tpu::System through tsl::AsyncValue events — bypassing the legacy stream_executor::tpu::TpuExecutor / TpuStream / TpuTransferManager stack entirely.

This page documents the bridge layer: the GetTpuPjRtClient construction path that turns a parsed config into a wired TpuClient, the TpuClient : CommonPjRtClient : PjRtClient class shape and its 3-arg-then-vector TFRT constructor, the device-enumeration loop that wraps each tpu::TpuCoreLocation into an xla::TpuDevice, and how an SE Stream/Executor would have been bound under each PjRtDevice — except that here a TpuDevice binds a shared tpu::System* plus a TpuCoreLocation and a throttle Semaphore, not a per-device executor. The C-ABI surface above this (PJRT_Client_Create, the 208-byte PJRT_Client wrapper, the device/memory accessor slots) is owned by Client, Device & Topology; the SE platform registration and host interpreter below it by StreamExecutor & Host Interpreter; the per-execution hot path by Execute Async On Stream. This page owns the seam in the middle: GetTpuPjRtClientTpuClientTpuDevice-wraps-tpu::System.

For reimplementation, the contract is:

  • The construction chain. GetTpuPjRtClient(config) → one-shot HostContext build → TpuStatesManager / TpuSystemState singleton → GetTpuPjRtClientInternal device enumeration → make_unique<TpuClient>(…) → optional MegaScale / Tf decorator → CreateWrapperClient.
  • The class hierarchy and ctor shape. TpuClient : CommonPjRtClient : PjRtClient; the eight-argument TFRT constructor that takes the device vector, host context, host allocator, system state, and states manager.
  • The device-wrapping rule. One xla::TpuDevice per tpu::TpuCoreLocation from TpuTopology::logical_devices, each holding a shared tpu::System* (not a private executor), bound to the client by a post-construction SetClient back-edge that also installs the AsyncWorkRunner.
  • The SE→TPU concept mapping. Which SE abstraction each TPU-runtime object replaces, so a reader carrying the upstream model can translate it forward.
Bridge entryxla::GetTpuPjRtClient(const PjRtTpuClientConfig&) @ 0xF8008C0 (1226 B)
Device enumerationxla::GetTpuPjRtClientInternal(…) @ 0xF800DA0
Adapter client classxla::TpuClient — ctor 0xF801980, vtable 0x2177B598, typeinfo 0x2177BA40
Client base frameworkxla::CommonPjRtClient — vtable 0x2178A108, typeinfo 0x21789EB0
Abstract PJRT basexla::PjRtClient — vtable 0x21CA9C98, typeinfo 0x21CA9E28
Adapter device classxla::TpuDevice — ctor 0xF7FDC40 (object 0x1E0 = 480 B), vtable 0x2177B4D0
Device→client bindxla::TpuDevice::SetClient(TpuClient*, AsyncWorkRunner*) @ 0xF7FE520
Device runtime backingtpu::SystemInitialize @ 0x1D0AE420 (the SE StreamExecutor analogue)
C-ABI wrappjrt::CreateWrapperClient(unique_ptr<PjRtClient>) @ 0xF872060

The SE → TPU-Runtime Concept Mapping

Purpose

The fastest way to read this layer is as a delta from the upstream PjRtStreamExecutorClient model the reader already owns. Every StreamExecutor abstraction has a TPU-runtime counterpart, but the shape differs: SE is a synchronous-stream model (a FIFO of operations per Stream, with Events to cross-stream-order); the TPU runtime is an async-value DAG model (a graph of tpu::TpuEvent dependencies issued through tpu::TpuEventIssuer sequence points). The adapter does not translate one into the other at runtime — it never constructs an SE object at all. It is built natively against the async model.

The Mapping Table

This is the single most important table for a reader carrying the upstream model. Each row is "what you would have looked for" → "what libtpu uses instead" → "the underlying TPU-runtime object."

Upstream SE / PjRt conceptlibtpu adapter objectTPU-runtime backing
PjRtStreamExecutorClient / TfrtTpuClientxla::TpuClient (: CommonPjRtClient : PjRtClient)tpu::System (singleton, via TpuSystemState / TpuStatesManager)
PjRtStreamExecutorDevicexla::TpuDevicetpu::System* + tpu::TpuCoreLocation + xla::Semaphore
LocalDeviceState + StreamExecutor (per device)(none — no per-device executor)shared tpu::System + AsyncWorkRunner + tfrt::ConcurrentWorkQueue
stream_executor::StreamExecutor(none)tpu::System (one, shared across all devices)
Stream::ThenLaunch(no Stream)tpu::System::Execute(AsyncValueRef<ProgramHandle>, …) @ 0x1D0B33E0
Stream::ThenMemcpy(no Stream)tpu::System::TransferTo/FromDevice @ 0x1D0AFA20 / 0x1D0B0160
stream_executor::Eventxla::TpuTrackedDeviceEventtpu::TpuEvent (a tsl::AsyncValueRef)
Stream ordering FIFO(no FIFO)tpu::TpuEventIssuer sequence points + tsl::AsyncValue deps
DeviceMemoryBasetpu::TpuSharedMemoryLocationTPU HBM address
PjRtStreamExecutorBufferxla::TpuRawBuffer (: CommonPjRtRawBuffer)tpu::TpuBuffer at a TpuSharedMemoryLocation
TransferManager(inline in TpuRawBuffer / TpuClient)tpu::System::TransferTo/FromDevice
SE Compiler (registered via factory)xla::TpuCompiler (: PjRtCompiler)jellyfish JIT (TpuJitCompileHloWithOptions)

QUIRK — the empty cells are the point of the table. A reimplementer who builds a LocalDeviceState-style object per device, each owning a StreamExecutor and a stream pool, will have reproduced upstream XLA, not libtpu. The TPU adapter shares one tpu::System across every TpuDevice; there is no per-device executor and no Stream at all. The per-device state that does exist is small: a TpuCoreLocation (which physical core this device maps to) and a Semaphore (an in-flight-launch throttle). Ordering that an SE backend would get from a per-stream FIFO comes instead from a process-wide TpuEventIssuer DAG.

Why the SE Stack Is Absent, Not Adapted

The legacy SE TPU stack does exist in the binary — stream_executor::tpu::TpuExecutor, tensorflow::tpu::TpuStream, tensorflow::tpu::TpuTransferManager, and the Tpu*_* / TpuExecutor_* / TpuStream_* C-ABI exports — but it is a parallel path, not a layer underneath PJRT. It backs xla::LocalClient / xla::Service and the TF-TPU op kernels, reached through the SE TpuPlatform registered at module init (RegisterTpuPlatformPlatformManager, documented on StreamExecutor & Host Interpreter). The PJRT TpuClient adapter never touches it: across the entire TpuClient / TpuDevice / TpuRawBuffer / TpuLoadedExecutable code range (0xF800DA00xF816EC0) there is no reference to ExecutorApiFn, SE_StreamExecutor, or TpuExecutor. The two host abstractions share the same TPU driver core but use different host-side scheduling: TFRT async-values (PJRT) versus SE Streams (legacy). This page documents only the PJRT branch.


GetTpuPjRtClient — The Construction Bridge

Purpose

GetTpuPjRtClient is where the C-ABI options layer hands off to the runtime. By the time it is called, PJRT_Client_Create has already validated the option map and parsed it into a typed PjRtTpuClientConfig (that path is owned by Client, Device & Topology). GetTpuPjRtClient brings up the process-wide infrastructure — the TFRT HostContext and the tpu::System singleton — exactly once, then enumerates devices and constructs the TpuClient. It returns a StatusOr<unique_ptr<TpuClient>> that the C-ABI layer optionally decorates and then wraps.

Entry Point

pjrt::tpu_plugin::PJRT_Client_Create (0xE6A8840)        ── slot 15; owns option parse (see client-and-device.md)
  └─ xla::GetTpuPjRtClient (0xF8008C0)                   ── THIS PAGE: brings up HostContext + tpu::System, builds TpuClient
       ├─ xla::CreateDefaultHostContext (lambda, __cxa_guard one-shot)
       │    ├─ tfrt::CreateMallocAllocator
       │    └─ tfrt::CreateMultiThreadedWorkQueue(DefaultThreadPoolSize(), …)
       ├─ xla::GetSingletonTpuStatesManager(bool)        (0xF958360)  ── mutex+guard singleton
       │    └─ TpuStatesManager::GetOrCreateTpuSystemState (0xF956E40)
       │         └─ xla::CreateTpuSystemState (0xF95A4A0)  ── lazily brings up tpu::System
       └─ xla::GetTpuPjRtClientInternal(config, host_ctx, system_state, states_mgr) (0xF800DA0)
            ├─ tpu::System::topology() / hal_location()
            ├─ for each TpuCoreLocation in TpuTopology::logical_devices:
            │    ├─ xla::TpuDeviceDescription(core_loc, topology, "TPU", "TpuDevice")
            │    └─ operator new(0x1E0) + xla::TpuDevice::TpuDevice(desc, name, ordinal, tpu::System*)
            ├─ make_unique<xla::TpuClient>(config, process_index, platform_version, devices,
            │                              host_ctx, host_allocator, system_state, states_mgr)
            └─ for each device: TpuDevice::SetClient(TpuClient*, AsyncWorkRunner*)  (0xF7FE520)

Algorithm

function GetTpuPjRtClient(config):                       // 0xF8008C0
    // (1) One-shot process-wide TFRT HostContext. Guarded by the
    //     CreateDefaultHostContext::kInstance __cxa_guard so every client
    //     in the process shares one HostContext / thread pool.
    if !kInstance:                                        // guard var, lines 70-83
        if cxa_guard_acquire(&kInstance_guard):
            alloc = tfrt::CreateMallocAllocator()         // line 85
            queue = tfrt::CreateMultiThreadedWorkQueue(   // line 92
                        DefaultThreadPoolSize(config), …) //   thread count from config
            kInstance = make_shared<tfrt::HostContext>(alloc, queue)   // line 94
            cxa_guard_release(&kInstance_guard)
    host_ctx = kInstance                                  // shared_ptr; refcount bumped (line 116)

    // (2) The tpu::System singleton, reached through the states manager.
    //     GetOrCreateTpuSystemState lazily runs CreateTpuSystemState the
    //     first time, which initialises tpu::System (Initialize @ 0x1D0AE420).
    states_mgr  = GetSingletonTpuStatesManager(use_global_tpu_system)  // 0xF958360, line 212
    system_state = states_mgr.GetOrCreateTpuSystemState(opts)          // 0xF956E40, line 214

    // (3) Tail into device enumeration + client construction.
    //     The 'a6' boolean selects an own-vs-borrow path for system_state.
    return GetTpuPjRtClientInternal(config, host_ctx,    // 0xF800DA0, lines 179/236
                                    system_state, states_mgr)

NOTE — the HostContext is a process-wide singleton, not per-client. The __cxa_guard over CreateDefaultHostContext::kInstance means a second PJRT_Client_Create in the same process reuses the same malloc allocator, the same multi-threaded work queue, and therefore the same AsyncWorkRunner thread pool. A reimplementer must not allocate a fresh thread pool per client or the throttle/scheduling assumptions in the execute path break.

Device Enumeration — GetTpuPjRtClientInternal

GetTpuPjRtClientInternal (0xF800DA0) is the loop that turns physical TPU cores into PjRt devices. It is the closest analogue to upstream's "build a LocalDeviceState per device," except no executor or stream pool is created — only a description and a TpuDevice, each pointed at the one shared tpu::System.

function GetTpuPjRtClientInternal(config, host_ctx, system_state, states_mgr):  // 0xF800DA0
    system   = system_state->system()                    // tpu::System* (line 136: j+64)
    topology = system->topology()                         // tpu::TpuTopology* (line 138)
    hal_loc  = system->hal_location()                     // line 132

    devices = vector<unique_ptr<TpuDevice>>{}
    cores   = topology->logical_devices(TpuCoreType)      // line 140; stride 56 bytes (line 148/323)
    for core_loc in cores:                                // each TpuCoreLocation, 56 B apart
        // (a) static identity: id/process/kind/attributes/memory
        desc = TpuDeviceDescription(core_loc, topology,   // line 155
                                    type_prefix="TPU", name="TpuDevice")
        // (b) the PjRt device object — 0x1E0 (480) bytes, holds a SHARED system ptr
        dev_mem = operator new(0x1E0)                      // line 156
        device  = TpuDevice::TpuDevice(desc, name, ordinal, system)   // line 206, 0xF7FDC40
        devices.push_back(unique_ptr<TpuDevice>{device})   // grows the vector (line 260 = length-error guard)
        // record local-device-id -> TpuSharedMemoryLocation (LocalSharedMemory, line 374)

    // (c) build the client; ctor takes ownership of the device vector
    client = make_unique<TpuClient>(                       // line 545
                 config, process_index, platform_version,
                 devices, host_ctx, host_allocator,
                 system_state, states_mgr)

    // (d) close the back-edge: each device learns its client + work runner
    for device in client->devices():
        device->SetClient(client.get(), client->async_work_runner())   // 0xF7FE520
    return client

GOTCHA — the device→client link is a post-construction back-edge (SetClient), not a constructor argument. The TpuClient ctor receives the device vector, but each TpuDevice does not know its owning client (or its AsyncWorkRunner) until SetClient runs afterward. A reimplementer who tries to use device->client() from inside the TpuClient constructor will read an uninitialised pointer. The two-phase wiring exists because the AsyncWorkRunner is a member of the fully-constructed client; it cannot be handed to the devices before the client object exists.

QUIRK — the TpuCoreLocation array is walked with a hard-coded 56-byte stride (lines 148, 323). TpuTopology::logical_devices returns a contiguous block, and the loop advances v18 += 56 rather than calling an iterator. That 56-byte TpuCoreLocation size is an ABI fact a reimplementer of the topology side must match exactly, or the device-enumeration loop reads garbage cores.

After Internal Returns — Decorators and the C Wrapper

Back in PJRT_Client_Create, the bare unique_ptr<TpuClient> is optionally wrapped:

  • MegaScalePjRtClient::CreateMegaScalePjRtClient (0xE6EA680) when multi-slice is active (FLAGS_megascale_port >= 0 and not skipped).
  • TfPjRtClient::CreateTfPjRtClient (0x108524E0) when use_tf_pjrt_client == 1 (the default).

Then CreateWrapperClient (0xF872060) boxes the outermost xla::PjRtClient* into the 208-byte C-ABI PJRT_Client. These three steps and the wrapper layout are owned by Client, Device & Topology; this page stops at the unique_ptr<TpuClient> that GetTpuPjRtClient produces.


TpuClient — The Adapter Client Class

Purpose

xla::TpuClient is the concrete PjRtClient for TPU. It is the object every C-ABI client slot ultimately virtual-calls into (after unwrapping any MegaScale/Tf decorator and the PJRT_Client POD). It overrides roughly 75 of the CommonPjRtClient framework methods — buffer creation, executable load, device assignment, transfer plumbing — and owns the per-process runtime handles a StreamExecutor client would have split across LocalDeviceStates.

Class Hierarchy

xla::PjRtClient                 (abstract base; vtable 0x21CA9C98, typeinfo 0x21CA9E28)
   ▲
xla::CommonPjRtClient           (async-value framework; vtable 0x2178A108, typeinfo 0x21789EB0)
   │   BufferFromHostBuffer, DefineBuffer, PrepareArguments, CreateOutputs,
   │   TrackFuture, ShouldRetryOnOom, GetTransposePlan, … (~79 methods)
   ▲
xla::TpuClient                  (the adapter; ctor 0xF801980, vtable 0x2177B598, typeinfo 0x2177BA40)
       overrides ~75 methods; holds devices, HostContext, HostMemoryAllocator,
       TpuSystemState, TpuStatesManager, Semaphore

Two optional decorators sit outside this hierarchy as wrappers, not subclasses:

DecoratorRoleEngage conditionSymbol
xla::MegaScalePjRtClientMulti-slice overlay over a unique_ptr<TpuClient> + MultiSliceConfig; forwards/augments compile+execute across slicesFLAGS_megascale_port >= 0, not skippedCreateMegaScalePjRtClient @ 0xE6EA680; typeinfo 0x215FD068
xla::TfPjRtClientTF-thread-safety decorator (DestroyWrappedBuffersAndClient); owns wrapped buffersuse_tf_pjrt_client == 1 (default)CreateTfPjRtClient @ 0x108524E0; typeinfo 0x217EFED0

xla::AbstractTpuClient (typeinfo 0x215FD080) is a thin base sharing topology helpers — e.g. GetSubsliceTopologyForCompilation (0xE6EE120) — shared between TpuClient and the MegaScale path.

The TFRT Constructor

The constructor signature is byte-confirmed from the demangled symbol at 0xF801980. It is the TFRT shape — async-value-native, taking pre-built infrastructure rather than constructing executors:

TpuClient::TpuClient(                                    // 0xF801980
    const PjRtTpuClientConfig&        config,
    int                               process_index,
    std::string                       platform_version,
    std::vector<unique_ptr<TpuDevice>> devices,          // takes ownership of the enumerated devices
    std::shared_ptr<tfrt::HostContext> host_context,     // the process-wide singleton
    std::unique_ptr<HostMemoryAllocator> host_allocator, // host-staging allocator
    std::unique_ptr<TpuSystemState, MaybeOwningDeleter> system_state,  // may own or borrow
    std::shared_ptr<TpuStatesManager> states_manager)

NOTE — the MaybeOwningDeleter on the TpuSystemState is the hinge of single-process-many-clients behaviour. When use_global_tpu_system is set, the system state is borrowed from the global singleton (the deleter is a no-op); otherwise the client owns it. This is why the a6 boolean flows from GetTpuPjRtClient into GetTpuPjRtClientInternal — it selects the own-vs-borrow construction of the MaybeOwningDeleter.

Object Field Layout — Not Fully Decoded

The constructor body's exact field stores were not dereferenced from the binary. From the ctor signature and the call sites, the object holds: the device vector, the HostContext shared_ptr, the HostMemoryAllocator, the TpuSystemState pointer (with its maybe-owning deleter), the TpuStatesManager shared_ptr, and a per-client Semaphore (the in-flight-computation throttle). The concrete byte offsets of these members are LOW confidence — a reimplementer needs a full decompile of 0xF801980 to place them. What is certain is the membership set and that async_work_runner() and devices() are accessible immediately after construction (they are read in GetTpuPjRtClientInternal's SetClient loop).

Function Map

FunctionAddrRole
TpuClient::TpuClient (ctor)0xF801980Build the adapter from enumerated devices + infra
TpuClient::LookupDevice0xF8033A0global device id → PjRtDevice*
TpuClient::platform_name0xF816D20"tpu"
TpuClient::platform_id0xF816D00TPU platform id constant
TpuClient::process_index0xF816C40host process index
TpuClient::memory_spaces0xF816CE0HBM + pinned/unpinned host spaces
TpuClient::AllocateBuffer0xF7FC5A0device HBM alloc → tpu::AllocateBuffer
TpuClient::AllocateRawBuffer0xF7FB1E0host/CPU-space alloc dispatch
TpuClient::CompileAndLoad (XlaComputation)0xF804F20compile entry → jellyfish JIT
TpuClient::CompileAndLoad (MLIR)0xF8068E0MLIR compile entry
TpuClient::TrackFuture0xF7FAD60tsl::AsyncValuePJRT_Event future
TpuClient::CreateProfiledFuture0xF7FAE80profiled completion future
TpuClient::CreateErrorEvent0xF808420tpu::TpuEvent set to error status
TpuClient::CreateDeviceEventSet0xF813D40mint wait/define event sets

NOTE — the compile, execute, buffer, and event surfaces above are documented in depth on their owning pages (Executable & Execution, Buffer & Memory, Events & Async). They appear here only to show what the adapter client exposes; this page owns their wiring into the SE→PJRT bridge, not their slot-level details.


TpuDevice — Wrapping tpu::System Under Each PjRtDevice

Purpose

xla::TpuDevice is the PjRtDevice the C-ABI device slots wrap. It is the object that, in the upstream SE model, would have held a StreamExecutor and a stream pool. Here it holds no executor: it points at the shared tpu::System, names the one physical core it maps to, and carries a launch-throttle semaphore. This is the crux of the "SE → PJRT adapter" claim — the device-level binding is to an async runtime, not to a stream-based executor.

What a TpuDevice Holds

The TpuDevice object is 0x1E0 (480) bytes (the operator new(0x1E0) in the enumeration loop). Its members, recovered from the ctor signature (0xF7FDC40) and SetClient (0xF7FE520):

MemberSourceRole
tpu::System*ctor arg 4The shared device runtime — same pointer in every TpuDevice
tpu::TpuCoreLocationfrom the enumeration loop's core_locThe physical core this PjRt device maps to
xla::TpuDeviceDescriptionctor arg 1Static identity: id/process/kind/attributes/memory
xla::TpuClient*SetClient arg 1Back-pointer to the owning client
xla::AsyncWorkRunner*SetClient arg 2Host scheduler for this device's continuations
xla::Semaphoreconstructed; fenced by ExecutablesStart/CompleteIn-flight-computation throttle (bounded by max_inflight_computations)

The tpu::System* being identical across all devices is the structural difference from SE. In upstream XLA, device[i]->local_device_state()->executor() returns a distinct StreamExecutor per device. Here, device[i] and device[j] share one tpu::System; what distinguishes them is the TpuCoreLocation, which tpu::System::Execute and LoadProgram consume to target the right core.

The SetClient Back-Edge

function TpuDevice::SetClient(client, work_runner):      // 0xF7FE520
    this->client       = client                          // back-pointer used by C-ABI device→client lookups
    this->work_runner  = work_runner                     // the client's AsyncWorkRunner (host scheduler)

This runs once per device immediately after the TpuClient is constructed. It is the only way a TpuDevice learns its scheduler. The client() accessor that the C-ABI PJRT_Device slots use to translate an implementation device back to its wrapper (via the client's device_map) reads this field — so it is invalid before SetClient, which is why the construction order in GetTpuPjRtClientInternal (build client, then loop SetClient) is load-critical.

The Launch Throttle — The Closest Thing to a Stream

TpuDevice carries a per-device xla::Semaphore bounded by max_inflight_computations (default 1, from the config). ExecutablesStart (0xF800300) acquires it before a launch; ExecutablesComplete (0xF800740) releases it when the launch's completion event fires. This is the functional stand-in for an SE stream's depth limit: it is what stops the host from queueing unboundedly many executions ahead of the device. It is not a FIFO — ordering between launches is enforced by the TpuEventIssuer sequence points, not by the semaphore. The semaphore only bounds concurrency.

QUIRK — with max_inflight_computations == 1 (the default), a single TpuDevice behaves much like a single SE stream: at most one execution in flight, the next blocked until the previous completes. But the resemblance is shallow — raising the knob makes the device accept multiple concurrent launches with no FIFO ordering between them, which an SE stream could never do. A reimplementer must treat the throttle as a concurrency cap, not as a serialising queue.

Device Memory Spaces

Each TpuDevice attaches three memory spaces (the per-device AttachMemorySpace calls):

  • xla::TpuHbmMemorySpace (kind "tpu_hbm") — device HBM.
  • xla::PinnedHostMemorySpace / xla::UnpinnedHostMemorySpace — host staging.

The canonical kind constants are xla::kBuiltinTpuMemorySpaces / xla::kTpuHbmMemorySpaceKind. The slot-level memory accessors and the buffer objects that live in these spaces are owned by Buffer & Memory.

Function Map

FunctionAddrRole
TpuDevice::TpuDevice (ctor)0xF7FDC40Build device from desc + name + ordinal + tpu::System*
TpuDevice::SetClient0xF7FE520Install client back-pointer + AsyncWorkRunner
TpuDevice::ExecutablesStart0xF800300Acquire the in-flight semaphore before launch
TpuDevice::ExecutablesComplete0xF800740Release the in-flight semaphore on completion

NOTE — the TpuDeviceDescription (vtable 0x21787AC0, typeinfo 0x21787C38) is the PjRt device description carrying id / process_index / device_kind / Attributes / memory. The SE-flavoured PjRtStreamExecutorDeviceDescription (vtable 0x2177D950, typeinfo 0x2177D9B0) is GPU-only dead code in this binary — its presence is the only residual SE-PjRt symbol, and it is never reached on the TPU path. Do not confuse the two when reading the binary.


The Runtime Backing — tpu::System as the StreamExecutor Analogue

Purpose

tpu::System (Initialize @ 0x1D0AE420) is the object the whole adapter sits on — the TFRT-native TPU device runtime. It is what the page title calls "StreamExecutor," mapped forward: it is the surface that offers launch, transfer, allocate, and event primitives, but async-value-shaped instead of stream-shaped. One tpu::System is shared by every TpuClient and every TpuDevice in the process (it is reached through the TpuStatesManager / TpuSystemState singleton built in GetTpuPjRtClient).

The API Surface (the SE-Stream-equivalent operations)

OperationVASE equivalent
Initialize(shared_ptr<TpuTopology>, InitOptions, ConcurrentWorkQueue*)0x1D0AE420StreamExecutor::Init
Execute(AsyncValueRef<ProgramHandle>, ExecuteOptions, Span inputs, outputs, wait, define)0x1D0B33E0Stream::ThenLaunch
LoadProgram(TpuCoreLocation, shared_ptr<TpuCoreProgram>)0x1D0B2240StreamExecutor::GetKernel (load once)
TransferToDevice(AsyncValueRef<TpuBuffer>, Span<uint8>, opt<TpuSyncFlagOnChip>)0x1D0AFA20Stream::ThenMemcpy (H2D)
TransferFromDevice(Span<uint8>, AsyncValueRef<TpuBuffer>, opt<…>)0x1D0B0160Stream::ThenMemcpy (D2H)
AllocateHostBuffer / MakeTpuHostBuffer0x1D0AF180 / 0x1D0AF660host-pinned alloc
EnqueueInfeed / DequeueOutfeed0x1D0B5D00 / 0x1D0B5F00infeed/outfeed

Execute takes the TpuCoreLocation (indirectly, via the ProgramHandle loaded for that core) — this is how one shared tpu::System services many TpuDevices without per-device executors. The internal hardware command-stream / ring layout behind Execute is private to tpu::System::Impl and the TPU driver core; it was not traced (it is a separate concern from this bridge layer).

NOTE — the device-side ordering primitive is tpu::TpuEventIssuer (AddDepsNoReserve, AggregateDeps, RunWhenDepsReady, NextSequencePoint, FulfillArgs, ChainScope). It is the sequence-point engine that orders Execute / Transfer / Allocate against each other — the functional replacement for an SE per-stream FIFO plus cross-stream Events, expressed as a DAG of tpu::TpuEvent (each a tsl::AsyncValueRef). Its exact sequencing algorithm was not byte-traced. The host-side scheduler that drives dependent continuations is xla::ThreadPoolAsyncWorkRunner (AsyncWorkRunner::ExecuteWhenReady), fed by the HostContext's tfrt::ConcurrentWorkQueue. Both are reached through TpuClient::async_work_runner().


Considerations for a Reimplementer

  • Do not build a LocalDeviceState. The single most likely mistake is to reproduce upstream XLA's per-device StreamExecutor + stream pool. The TPU adapter has none. Build one shared tpu::System, hand its pointer to every TpuDevice, and distinguish devices by TpuCoreLocation alone.
  • Singletons are process-wide, not per-client. Both the HostContext (via CreateDefaultHostContext::kInstance) and the tpu::System (via TpuStatesManager) are __cxa_guard / mutex one-shots. A second client in the process must reuse them, not rebuild them. The MaybeOwningDeleter on TpuSystemState encodes the borrow.
  • The device↔client wiring is two-phase. Construct the client with the device vector, then loop SetClient to install the back-pointer and the AsyncWorkRunner. Devices are not fully usable until that loop runs.
  • Concurrency comes from a semaphore, ordering from an event DAG. The per-device Semaphore caps in-flight launches; TpuEventIssuer sequence points order them. These are orthogonal — conflating the throttle with a FIFO will produce wrong ordering at max_inflight_computations > 1.
  • The legacy SE stack is a decoy. TpuExecutor / TpuStream / TpuTransferManager exist in the same binary but serve LocalClient / TF kernels, never PJRT. A reimplementer of the PJRT path should ignore every ExecutorApiFn / SE_* symbol.

ComponentRelationship
PJRT_Client_Create (slot 15)Calls GetTpuPjRtClient; owns the option parse and the C-ABI wrapper above this bridge
xla::CommonPjRtClientThe async-value framework TpuClient derives from; supplies buffer/executable/event plumbing
MegaScalePjRtClient / TfPjRtClientOptional decorators stacked on the bare TpuClient by PJRT_Client_Create
tpu::SystemThe shared device runtime every TpuDevice points at; the SE-StreamExecutor analogue
tpu::TpuEventIssuerDevice-side sequence-point engine; replaces SE per-stream FIFO + events
Legacy SE TpuPlatformThe parallel, non-PJRT device layer; never referenced by TpuClient

Cross-References

  • Client, Device & Topology — owns PJRT_Client_Create, the option-kv ingest, and the 208-byte PJRT_Client wrapper that boxes the TpuClient this page builds
  • StreamExecutor & Host Interpreter — the legacy SE TpuPlatform registration and host-interpreter path that runs parallel to (never under) this adapter
  • Executable & ExecutionPJRT_LoadedExecutable_Execute and the TpuExecutable / TpuLoadedExecutable / TpuRawLoadedExecutable framework the adapter client loads programs into
  • Execute Async On Stream — the per-execution path down to tpu::System::Execute, the runtime side of the launch primitive this page maps from Stream::ThenLaunch
  • Buffer & MemoryTpuRawBuffer over tpu::TpuBuffer, the memory spaces each TpuDevice attaches, and the allocator glue
  • Events & AsyncPJRT_Eventtsl::Future<void>, the TpuTrackedDeviceEvent / TpuEventIssuer model this page references for ordering
  • Overview — the PJRT plugin entry, extension chain, and GetPjrtApi population