Infeed / Outfeed Queues
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (buildlibtpu_lts_20260413_b_RC00, build-id md589edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes). The image is not stripped; demangled C++ symbol names are quoted verbatim. VA == file analysis address. Other versions will differ.
Abstract
Infeed and outfeed are TPU's streaming host↔device channels: a running program issues Infeed/Outfeed HLO ops that block on a hardware FIFO, and the host concurrently pushes input literals into the infeed queue and drains result literals out of the outfeed queue. This is structurally different from a bulk buffer transfer (a PJRT_Buffer copy, documented on Host↔Device DMA): a buffer copy targets a specific HBM allocation and completes once; an infeed/outfeed transfer targets a named per-core queue identified only by a TpuCoreLocation + a small integer queue index, and the device side consumes/produces entries in program order as the executable runs. The XLA reference frame is the same one upstream uses — TransferManager::TransferLiteralToInfeed / TransferLiteralFromOutfeed and PjRtDevice::TransferToInfeed / TransferFromOutfeed — but libtpu ships two parallel implementations of it, and a reimplementer must not conflate them.
The modern path is xla::TpuDevice::TransferToInfeed(const LiteralSlice&) / TransferFromOutfeed(MutableBorrowingLiteral) (learning/45eac/research/pjrt/tpu_pjrt_client.cc), the PJRT device surface over the TFRT-native tpu::System runtime. It linearizes the host literal into device-layout buffers, chops each buffer into hardware spans sized by TransferSizeUtil, and enqueues each span through tpu::System::EnqueueInfeed / DequeueOutfeed — a TpuCoreLocation-keyed, callback-based, fully blocking transfer. The legacy path is tensorflow::tpu::TpuTransferManager::TransferLiteralToInfeed / TransferLiteralFromOutfeed(StreamExecutor*, …), a thin C-ABI shim that marshals the literal through ApiConverter::ToC and calls the TPU driver's TfTpu_ExecutorApiFn table (EnqueueInfeed slot +560, TransferLiteralFromOutfeed slot +576), which lands in deepsea::executor::DeepseaExecutor::EnqueueInfeed. Both ultimately reach the same on-chip queue hardware; they differ only in the host-side abstraction (TFRT async runtime vs. StreamExecutor C-shim).
This page owns the infeed/outfeed queue mechanism, the two transfer-manager entry points, the shape/layout linearization contract, and the blocking/async semantics. The on-chip queue driver (TpuInfeedQueue / TpuOutfeedQueue with their Pxc/Jxc generation impls) is described only to the depth needed to understand the contract; the general bulk-buffer DMA mechanism is on Host↔Device DMA; the execute path that interleaves with these queues is on Execute Async on Stream.
For reimplementation, the contract is:
- The two entry points — PJRT
TpuDevice::TransferToInfeed/FromOutfeedovertpu::System, vs. legacyTpuTransferManagerover theExecutorApiFnC-shim — and the fact that they are independent code paths, not one wrapping the other. - The queue handle — a transfer names its target by
tpu::TpuCoreLocation+ anintqueue index, not by a device address. The queue object is resolved per-call (TpuChipConfig::GetInfeedQueues/tpu::System::EnqueueInfeed's topology walk), never held by the caller. - The layout/linearization contract — the host literal's shape is converted to device shape (
TransferSizeUtil::HostShapeToDeviceShape), tiled, and linearized (LiteralLinearizer::LinearizeToBuffers); outfeed runs the inverse (Delinearize). Element types are gated byHardwareLayout::SupportedPrimitiveType. - The span-chunking + blocking semantics — a linearized buffer is split into
TensorCoreInfeedSpanSizeBytes/TensorCoreMaxOutfeedSpanSizeByteschunks, each enqueued as a callback-completed task into anAsyncTaskGroup/BlockableAsyncTaskGroup, and the call blocks (WaitTillDone/Mutex::LockWhenCommonon a remaining-count predicate) until every span completes.
| PJRT infeed entry | xla::TpuDevice::TransferToInfeed(const LiteralSlice&) @ 0xf7ff540 (tpu_pjrt_client.cc:2153) |
| PJRT outfeed entry | xla::TpuDevice::TransferFromOutfeed(MutableBorrowingLiteral) @ 0xf7ffca0 (:2194) |
| PJRT outfeed helper | xla::TransferFromOutfeedHelper(TpuCoreLocation, Layout, System*, MutableBorrowingLiteral*) @ 0xf8436e0 (outfeed_utils.cc) |
| PJRT infeed span loop | tpu::TransferLinearizedBufferToInfeed(Span<uint8>, TpuCoreLocation, AnyInvocable, System*) @ 0xf8d5cc0 |
| Driver enqueue/dequeue | tpu::System::EnqueueInfeed @ 0x1d0b5d00, tpu::System::DequeueOutfeed @ 0x1d0b5f00 |
| Legacy SE infeed entry | tensorflow::tpu::TpuTransferManager::TransferLiteralToInfeed(StreamExecutor*, LiteralSlice&) @ 0xe9721c0 |
| Legacy SE outfeed entry | tensorflow::tpu::TpuTransferManager::TransferLiteralFromOutfeed(StreamExecutor*, MutableBorrowingLiteral) @ 0xe972660 |
| Legacy C-shim slots | ExecutorApiFn()+560 (infeed enqueue), +576 (outfeed dequeue); status +392/+400/+408/+384 |
| Legacy driver leaf | TpuExecutor_EnqueueInfeed @ 0xeab9680 → deepsea::executor::DeepseaExecutor::EnqueueInfeed; TpuExecutor_DequeueOutfeed @ 0xeab96c0 |
| Queue handle | tpu::TpuCoreLocation + int queue index (NOT a device address) |
| Queue objects | tpu::TpuInfeedQueue / tpu::TpuOutfeedQueue, Pxc/Jxc per-generation impls |
| Evidence grade | Reimplementation-grade / byte-confirmed against IDA decompile (both paths traced end-to-end) |
1. Two Transfer Managers, One Queue Hardware
libtpu carries the full XLA transfer-manager surface twice. The byte-confirmed split:
| Aspect | PJRT path | Legacy StreamExecutor path |
|---|---|---|
| Infeed entry | xla::TpuDevice::TransferToInfeed @ 0xf7ff540 | tensorflow::tpu::TpuTransferManager::TransferLiteralToInfeed @ 0xe9721c0 |
| Outfeed entry | xla::TpuDevice::TransferFromOutfeed @ 0xf7ffca0 | …::TpuTransferManager::TransferLiteralFromOutfeed @ 0xe972660 |
| Caller surface | PJRT_Device_* / PjRtDevice (JAX, PyTorch-XLA) | xla::LocalClient / xla::Service, TF-TPU op kernels |
| Device runtime | tpu::System (TFRT async-value native) | TfTpu_ExecutorApiFn C-ABI table (SE_StreamExecutor*) |
| Marshalling | LiteralLinearizer in-process | ApiConverter::ToC → C-shim → driver |
| Source root | learning/45eac/research/pjrt/ | third_party/tensorflow/compiler/xla/stream_executor/tpu/ |
| Driver leaf | tpu::System::EnqueueInfeed / DequeueOutfeed | DeepseaExecutor::EnqueueInfeed / DequeueOutfeed |
GOTCHA — these are not layered. The PJRT
TpuDevicepath has zero references toExecutorApiFn/SE_StreamExecutor/TpuExecutor(byte-confirmed by a range scan over theTpuClienttext); it talks totpu::Systemdirectly. A reimplementer who assumes PJRT infeed forwards throughTpuTransferManagerwill be wrong: PJRT linearizes and enqueues itself. The two paths only converge inside the TPU driver core (the on-chip queue), which both reach by a different last hop. Confidence: CONFIRMED.
There is a third, generation-specific variant — xla::DeepseaTransferManager::TransferLiteralToInfeed @ 0xeac3cc0 / TransferLiteralFromOutfeedLocked @ 0xeac4b80 — which is the jellyfish/Deepsea TransferManager subclass; it shares the linearizer/TransferSizeUtil machinery with the PJRT path. The CPU backend ships its own (xla::CpuTransferManager @ 0xf93fda0/0xf93fde0 → TransferLiteralToInfeedOnCpu @ 0xf940080), used only when HLO lands on the host CPU device. This page documents the TPU infeed/outfeed; CPU infeed is a posix-style host-memory ring outside scope.
2. The Queue Handle and the Layout Contract
2.1 What names a queue
A transfer does not carry a device address. The infeed/outfeed target is named by:
- a
tpu::TpuCoreLocation— which physical TensorCore the queue belongs to; and - an
intqueue index — which of that core's several infeed/outfeed queues to use.
tpu::System::EnqueueInfeed(const TpuCoreLocation&, int queue_id, Span<const uint8>, AnyInvocable<void(const Status&)>) (0x1d0b5d00) resolves the actual queue object from these on every call:
// tpu::System::EnqueueInfeed sub_1D0B5D00
// a2 = TpuCoreLocation, a3 = queue_id, a4/a5 = Span<uint8>, a6 = completion callback
chip = TpuCoreLocation::Chip(core_loc); // @0x1d0b5d2f
topo = system.topology()->chip_to_node(chip); // vtable +80
node = topo->core_for(core_loc.core_index); // vtable +32 (core_loc+0x30)
queue = node->infeed_queue(queue_id); // vtable +48
status = queue->Validate(); // vtable +32
if (status != OkStatus) { invoke callback(status); return; }
// resolve the per-core TpuAllocator by its shared-memory location
alloc = system.allocators().at(core_loc.LocalSharedMemory(0)); // flat_hash_map<TpuSharedMemoryLocation, unique_ptr<TpuAllocator>>
view = alloc->MapDmaBuffer(span_ptr, span_len); // vtable +168
if (view.mapped) queue->Enqueue(view, …, callback); // vtable +56 (mapped DMA buffer)
else queue->EnqueueImpl(span, callback); // vtable +40 (raw span)
So the queue handle is ephemeral: the caller holds a TpuCoreLocation and an index, and the runtime walks topology → chip-node → core → infeed_queue[id] every enqueue. There is no persistent queue pointer in the PJRT TpuDevice. DequeueOutfeed (0x1d0b5f00) is the mirror over outfeed_queue(queue_id).
NOTE — queue index 0 in practice. Both PJRT span loops (
TransferLinearizedBufferToInfeed@0xf8d5cc0and the outfeed loop inTransferFromOutfeedHelper) passqueue_id = 0for every span; the index argument exists for multi-queue cores but the literal-transfer path uses queue 0. The per-core list of infeed queues comes fromTpuChipConfig::GetInfeedQueues(TpuCoreType)@0x20afcc80, andTpuDevice::TransferToInfeedRET_CHECKs!infeed_queues.empty()(tpu_pjrt_client.cc:2158) before linearizing. Confidence: CONFIRMED.
2.2 The shape → device-layout linearization
The host literal cannot go to the queue as-is; the device wants its own tiled layout. The contract, identical on both PJRT entry points:
- Host shape → device shape.
xla::jellyfish::TransferSizeUtil::HostShapeToDeviceShape(out_shape, system, host_shape, 0, 0, 1)computes the on-device shape (tiling, padding). Infeed:TpuDevice::TransferToInfeedline ~0xf7ff540+…. Outfeed:TransferFromOutfeed@0xf7ffca0. - Element-type gate. Outfeed verifies
xla::jellyfish::HardwareLayout::SupportedPrimitiveType(element_type); an unsupported dtype returnsUnimplemented("Attempted to transfer array of shape %s from a TPU device. Transferring data with element type %s has not been implemented on TPUs.")(outfeed_utils.cc). Infeed relies on the linearizer's own checks. - Linearize (infeed) / delinearize (outfeed). Infeed calls
xla::jellyfish::LiteralLinearizer::LinearizeToBuffers(system, literal, device_shape, &buffers, …)producing a vector of device-layout byte buffers; the last argumentv12 = (GetInfeedQueues()[0]+64 == 1)selects a packing variant. Outfeed allocates aposix_memalign-aligned receive buffer, dequeues into it, thenxla::jellyfish::LiteralLinearizer::Delinearize(topology, device_shape, raw_buffer, byte_count/4, layout, system)reshapes the raw device bytes back into the caller'sMutableBorrowingLiteral. - Tuple handling (outfeed).
TransferFromOutfeedwalksShapeUtil::TupleElementCountand callsTransferFromOutfeedHelperonce per tuple leaf (each leaf gets its ownMutableBorrowingLiteralview at indexi); a non-tuple shape is a single call. Outfeed also has a fastTransferX64OrX128FromOutfeedHelperarm for 64-bit and 128-bit element widths (ElementHasBitWidth(64)/(128)).
GOTCHA — the layout must be tile-populated.
TransferFromOutfeedHelperfirst runsLayoutUtil::ValidateLayoutForShape, thenTransferSizeUtil::GetCompactTiles. If the device layout has no tiles it fails with"Device layout of an array needs to be populated with tiles, got <layout>"(outfeed_utils.cc:92). A reimplementer feeding an untiled layout into outfeed gets anInvalidArgument, not a silent raw copy. Confidence: CONFIRMED.
3. The PJRT Path — Span Chunking and Blocking
3.1 Infeed: TransferToInfeed → TransferLinearizedBufferToInfeed
After linearizing the literal into a buffer vector, TpuDevice::TransferToInfeed schedules every buffer into a tpu::BlockableAsyncTaskGroup (sized to the buffer count) under the device's infeed mutex (this+376), invokes tpu::TransferLinearizedBufferToInfeed per buffer, then blocks the calling thread on WaitTillDone:
// xla::TpuDevice::TransferToInfeed sub_F7FF540 (tpu_pjrt_client.cc:2153)
RET_CHECK(!infeed_queues.empty()); // :2158
device_shape = TransferSizeUtil::HostShapeToDeviceShape(system, literal.shape());
linear_ok = LiteralLinearizer::LinearizeToBuffers(system, literal, device_shape,
&buffers, …, pack_variant);
if (linear_ok != Ok) return InitRep(linear_ok); // :2176
mutex.lock(this+376); // infeed serialization lock
group = BlockableAsyncTaskGroup{ total = buffers.count }; // util.h:124 CHECK total>0
for (buf : buffers) {
CHECK(--unscheduled_count_ >= 0); // util.h:138 capacity guard
next_done = group.NextDone(); // per-task completion callback
queue = walk to first ready infeed queue node; // skip nodes with (flags&3)!=0
tpu::TransferLinearizedBufferToInfeed(buf.ptr, buf.len, core_location,
next_done, queue /*+64*/); // @0xf8d5cc0
}
group.WaitTillDone(); // Mutex::LockWhenCommon on count==0
// destroy group, release buffers, unlock
tpu::TransferLinearizedBufferToInfeed (0xf8d5cc0) is the per-buffer span loop. It splits the device buffer into hardware spans of TransferSizeUtil::TensorCoreInfeedSpanSizeBytes bytes and enqueues each:
// tpu::TransferLinearizedBufferToInfeed sub_F8D5CC0
// a1 = buffer bytes, a2 = byte_count, a3 = TransferSizeUtil*, a4 = callback, a5 = System*
span = TransferSizeUtil::TensorCoreInfeedSpanSizeBytes(util, core, 1, callback); // @0xf8d5d…
n_full = byte_count / span;
n_tasks = n_full + (byte_count % span != 0); // ceil-div task count
group = AsyncTaskGroup(operator new(64,16), n_tasks, callback);
ptr = a1;
while (remaining >= span) { // full spans: enqueue in place
CHECK(--unscheduled_count_ >= 0); // util.h:52
tpu::System::EnqueueInfeed(system, core, /*queue=*/0, ptr, span, group.NextDone());
ptr += span; remaining -= span;
}
if (remaining > 0) { // tail span: pad to span size
pad_len = TensorCoreInfeedSpanSizeBytes(util, remaining-core, 1, …);
posix_memalign(&tail, 0x20, pad_len); // 32-byte aligned
memcpy(tail, ptr, remaining);
memset(tail + remaining, 0, pad_len - remaining); // zero-pad the partial span
tpu::System::EnqueueInfeed(system, core, 0, tail, pad_len, owning_callback); // frees tail
}
Two details a reimplementer must reproduce: (a) every span enqueue is a separate EnqueueInfeed with its own completion callback fed by the AsyncTaskGroup; (b) a partial trailing span is copied into a fresh posix_memalign(32) buffer and zero-padded to a full span width before enqueue (the device reads whole spans). The owning variant of the callback frees that padded buffer on completion.
3.2 Outfeed: TransferFromOutfeed → TransferFromOutfeedHelper
The outfeed mirror computes the padded device byte count, then dequeues span-by-span into one aligned receive buffer, blocks on a count predicate, and delinearizes:
// xla::TransferFromOutfeedHelper sub_F8436E0 (outfeed_utils.cc)
ValidateLayoutForShape(core, layout); // :82 on failure
device_shape = HostShapeToDeviceShape(...); GetCompactTiles(...); // tile check :92
if (!HardwareLayout::SupportedPrimitiveType(elt)) return Unimplemented(...);
padded = TransferSizeUtil::ShapeSizeCompactForDma(topology, device_shape);
RET_CHECK(padded % sizeof(uint32_t) == 0); // :143
if (elt is x64/x128 array && fast-path layout) // direct into literal
dst = MutableLiteralBase::untyped_data(literal);
else { posix_memalign(&dst, 0x20, padded); } // staging buffer
max_span = TransferSizeUtil::TensorCoreMaxOutfeedSpanSizeBytes(topology);
remaining = padded; off = 0; outstanding = 0;
while (remaining > 0) {
chunk = min(remaining, max_span);
++outstanding; // count guarded by mutex
tpu::System::DequeueOutfeed(core, system, /*queue=*/0, dst+off, chunk, on_chunk_done);
off += chunk; remaining -= chunk;
}
Mutex::LockWhenCommon(predicate: outstanding == 0); // block until all spans done
// then reshape device bytes back into the literal
LiteralLinearizer::Delinearize(topology, device_shape, dst, padded/4, layout, system); // :231
free(dst);
TransferFromOutfeed @ 0xf7ffca0 holds the device's outfeed mutex (this+0x180) across the whole tuple-leaf loop and calls TransferFromOutfeedHelper for each leaf. On a layout-validate failure the helper returns at outfeed_utils.cc:82; on a dequeue error it surfaces CreateStatusAndConditionallyLog(212, …); a delinearize failure surfaces :231.
QUIRK — outfeed dequeues into a staging buffer, then delinearizes; infeed linearizes, then enqueues. The two are not symmetric in their buffer handling. Infeed produces N device buffers up front and streams each as spans. Outfeed receives raw device bytes into one aligned buffer (or, for x64/x128 fast path, straight into the literal's
untyped_data) and afterwards runsDelinearizeto lay the bytes out as the host literal. A reimplementer who tries to delinearize per-span will corrupt multi-span shapes. Confidence: CONFIRMED.
3.3 Blocking / async semantics
Both PJRT entry points are synchronous to the caller: TransferToInfeed blocks in BlockableAsyncTaskGroup::WaitTillDone and TransferFromOutfeed blocks in Mutex::LockWhenCommon until every span's completion callback has fired. The device side is asynchronous — each EnqueueInfeed/DequeueOutfeed returns immediately after handing the span to the queue, and a TPU-driver completion fires the AnyInvocable<void(const Status&)> callback that decrements the outstanding-task count. The host thread is parked on an absl::Mutex condition, not spinning. A device-side error is delivered as the Status argument to the callback and propagated out as the transfer's return status. Per-device serialization is enforced by the infeed mutex (TpuDevice+376) and outfeed mutex (TpuDevice+0x180): two concurrent TransferToInfeed calls on one device cannot interleave their spans.
4. The Legacy StreamExecutor Path
tensorflow::tpu::TpuTransferManager is the xla::TransferManager subclass used by LocalClient/Service and TF-TPU op kernels. Its infeed/outfeed are pure C-ABI marshalling — no linearization in libtpu's own code; the driver does it behind the shim.
// tensorflow::tpu::TpuTransferManager::TransferLiteralToInfeed sub_E9721C0
fn = ExecutorApiFn();
ctx = fn[360](); // make SE_Status / status carrier (slot +360)
ApiConverter::ToC(literal_slice); // marshal LiteralSlice → XLA_Literal C struct
fn[560]( this->se_handle, // ExecutorApiFn slot +560 = infeed enqueue
executor->se_executor, // *(StreamExecutor+7)
&c_literal, ctx );
ApiConverter::Destroy(&c_literal);
status = fn[408](ctx) ? Ok // slot +408 = StatusOk?
: MakeRep(fn[400](ctx)*4|1, // slot +400 = code, +392 = message ptr
fn[392](ctx), strlen, // status_helper.h:38
38, "…/status_helper.h");
fn[384](ctx); // slot +384 = free status carrier
return status;
Outfeed (TransferLiteralFromOutfeed @ 0xe972660) is identical except it marshals two C structs (ApiConverter::ToC(literal.shape()) into a shape carrier and ApiConverter::ToC(LiteralSlice(literal)) into a literal carrier) and calls slot +576 instead of +560. The leaf exports TpuExecutor_EnqueueInfeed @ 0xeab9680 and TpuExecutor_DequeueOutfeed @ 0xeab96c0 forward into the deepsea::executor::DeepseaExecutor::EnqueueInfeed (@ 0x1d0dbbe0) / DequeueOutfeed (@ 0x1d0dc160) layer in this same binary:
// TpuExecutor_EnqueueInfeed sub_EAB9680 (the C-ABI export the +560 slot targets)
result = deepsea::executor::DeepseaExecutor::EnqueueInfeed(*executor);
// result is an absl::Status; refcount-managed into *out_status
NOTE — slot numbers are the contract. The legacy path's only libtpu-visible knobs are the
TfTpu_ExecutorApiFntable offsets: +560 = infeed enqueue, +576 = outfeed dequeue, +360 = status-carrier ctor, +392/+400/+408/+384 = message/code/ok/destroy. These are part of the TPU-driver C-ABI (a separate task; not the full table). A reimplementer of the legacy path reproduces the marshalling and these slot calls, not the linearizer — the driver owns layout. Confidence: CONFIRMED for the slot numbers and call shape. TheDeepseaExecutor::EnqueueInfeed(int, Span<uint8>)body is present in this binary (0x1d0dbbe0; the callback-carrying overload0x1d0dbe20): it takes aDeepseaPlatform::ScopedRef, resolves the per-index queue object through the executor's queue table (vtable +48), and calls that queue's enqueue (vtable +48) on the already-marshalled span. What stays opaque fromlibtpu.sois the layout/linearization the driver performs before this point and the silicon FIFO write below the queue object, not the executor enqueue itself.
5. On-Chip Queue Objects (Driver Depth)
Below tpu::System::EnqueueInfeed/DequeueOutfeed sits a generation-specific queue object. The symbol set confirms a tpu::TpuInfeedQueue / tpu::TpuOutfeedQueue abstract base with per-silicon impls:
| Class | Generation | Key methods (confirmed symbols) |
|---|---|---|
tpu::TpuInfeedQueuePxcDriverImpl | Pxc (Pufferfish/BarnaCore) | EnqueueImpl(Span<uint8>, AnyInvocable) @ 0xe802600 |
tpu::TpuOutfeedQueuePxcDriverImpl | Pxc | RawDequeue(Span<uint8>/DmaBuffer*, AnyInvocable) @ 0xe803de0/0xe803e80 |
tpu::TpuInfeedQueueJxcDriverImpl | Jxc (Jellyfish) | EnqueueImpl(TpuMappedDmaBufferView/Span, …) @ 0xe7380e0/0xe737c60 |
tpu::TpuOutfeedQueueJxcDriverImpl | Jxc | RawDequeue(...) @ 0xe73a100/0xe73a320, MapDmaBuffer @ 0xe73a520 |
tpu::TpuPxcDriver | Pxc driver | EnqueueInfeed(TpuInfeedQueueOnChip, Span, AnyInvocable) @ 0xe8109c0, DequeueOutfeed @ 0xe811380 |
tpu::JfSoftwareInfeedQueueController | Jxc software queue | ThrottledSharedMemoryWrite(...) @ 0xe738800/0xe738900 |
The Pxc infeed enqueue shows the queue-index dispatch and a sparsecore offload fork:
// tpu::TpuInfeedQueuePxcDriverImpl::EnqueueImpl sub_E802600
queue_index = this[16]; // signed int
driver = this[136]; // TpuPxcDriver*
if (queue_index < 0) // negative index = BarnaCore/sparsecore
TpuPxcDriver::EnqueueBarnaCoreHmf(driver); // HMF (host-memory-feed) offload
else
TpuPxcDriver::EnqueueInfeed(driver, on_chip_queue, queue_index, span_ptr, span_len, cb);
A reimplementer needs from this layer only: (a) the queue is selected by an integer index resolved through the topology, (b) a negative index reroutes to the BarnaCore/sparsecore host-memory-feed path rather than the TensorCore infeed FIFO, and (c) the actual silicon enqueue/dequeue (DMA buffer mapping, FIFO write) is inside the TPU driver core and is opaque from this binary. The TpuChipConfig::InfeedQueue::PerCore / Hardware::PerCore variant types (0xe7274c0, 0xe7386c0) confirm the queues are configured per-core with a hardware-or-software representation. Confidence: HIGH on the class structure and dispatch; LOW on the silicon FIFO mechanics.
6. Reimplementation Notes
- Pick the right entry. A PJRT consumer (JAX/PyTorch-XLA) uses
TpuDevice::TransferToInfeed/FromOutfeedand never touchesTpuTransferManager. OnlyLocalClient/Service/TF-TPU kernels use the legacyTpuTransferManager+ C-shim. Implementing one does not implement the other. - The handle is
{TpuCoreLocation, int}. Do not model an infeed/outfeed queue as a device pointer. The runtime re-resolvestopology → chip-node → core → queue[id]on every enqueue; the literal-transfer path always usesqueue_id = 0. - Layout is mandatory, both directions. Infeed:
HostShapeToDeviceShape→LinearizeToBuffers. Outfeed: dequeue raw →Delinearize. Outfeed additionally requires a tile-populated layout (GetCompactTiles) and a supported element type (SupportedPrimitiveType); the byte count must be% 4 == 0. - Spans, not buffers, hit the queue. Chop each device buffer by
TensorCoreInfeedSpanSizeBytes(infeed) /TensorCoreMaxOutfeedSpanSizeBytes(outfeed). Zero-pad the trailing partial infeed span to full span width in a 32-byte-aligned buffer. - Blocking is the API. Both PJRT entries block the caller until all span callbacks fire (
BlockableAsyncTaskGroup::WaitTillDone/Mutex::LockWhenCommon). The device side is async via per-spanAnyInvocable<void(const Status&)>callbacks; errors arrive as the callback'sStatus. Per-device infeed/outfeed mutexes serialize concurrent transfers. - This is a streaming channel, not a buffer copy. Contrast Host↔Device DMA: a bulk
PJRT_Buffercopy targets one HBM allocation and completes once; infeed/outfeed feed a program-ordered FIFO that the running executable consumes/produces as it hitsInfeed/OutfeedHLO ops.
Cross-References
- Host↔Device DMA — the bulk-buffer transfer path (
TpuRawBuffercopies, UHI/HIB DMA spans); contrast: infeed/outfeed is the streaming-queue path, distinct from bulk transfer - Execute Async on Stream — the execute path whose running program drives the
Infeed/OutfeedHLO ops these queues feed - Stream Semantics — the
tpu::Systemordering / sequence-point model the enqueue/dequeue callbacks complete against - Host Callbacks — the sibling host↔device control channel (
DoHostCallbackWithStatus), the other way a running program reaches the host - UHI Host-Interface DMA — the host-interface DMA band that carries the on-chip side of these transfers in the profiler
- Runtime Overview — where the transfer managers and
tpu::Systemsit in the libtpu runtime stack