Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Error/Status String Templates

All offsets, symbols, and counts on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (libtpu_lts_20260413_b_RC00, ELF x86-64, 781,691,048 bytes, BuildID md5 89edbbe81c5b328a958fe628a9f2207d, not stripped). Other builds will differ.

Abstract

libtpu builds its diagnostics out of format-string templates baked into .rodata and filled at the call site. There are two formatting idioms and one wrapping convention. The dominant idiom is C-printf-style — absl::StrFormat("…%s…%d…", args) and LOG(...) << absl::StrFormat(...) — which accounts for the large majority of the surface; the minority idiom is absl positional substitution — absl::Substitute("…$0…$1…", args) — confined mostly to the megascale DCN runtime, the collective buffer-size validators, and a statically-linked protobuf descriptor validator. Most of these templates are then wrapped in an absl::Status by one of three factory families (see Status-Code Mapping), so the same string is simultaneously the human-readable error and the payload of a StatusOr<T> that propagates back through PJRT to a JAX/XLA user.

This page is a reference catalog, not an algorithm trace. Its value is the grouped, deduplicated table of real templates with their placeholders explained and their absl::Status code attributed where that attribution is byte-confirmed. The recovered surface is roughly 2,937 distinct error/status templates; this page does not reproduce all of them — it documents the structure of the space (the placeholder grammar, the subsystem partition, the three Status-construction idioms, the byte-confirmed argument types) and gives a representative, source-anchored sample per subsystem so a reimplementer can predict the shape of any template not listed.

Placeholder semantics are the through-line. A %s is almost never a raw C string: it is some object's ToString() / AbslStringify — a Shape, a Layout, an HloInstruction name, a device name, an opcode mnemonic. A %d/%lld/%zu is a dimension, an ordinal, a count, or a byte size. Pointers (%p) and floats (%f) appear only in low-level driver and cost-model paths. The catalog calls out, per template, what each placeholder means, because the printf type alone (%s = char*) hides the real C++ argument.

NOTE — the error/status templates here are distinct from user-facing hint strings (suggestions phrased as "try…", "consider…", not failures), which live on hint-strings.md, and from the internal pass-name strings (pipeline-stage identifiers, not diagnostics) on internal-pass-names.md. This page owns the error/status format-string catalog, the placeholder semantics, and the status-code mapping.

Distinct error/status templates~2,937 (~2,799 printf-style, ~138 positional $N/%v)
Status-construction idioms3 (<Code>StrCat factory, MakeErrorStream, prose absl::<Code>Error)
Byte-confirmed arg-type factories99 xla::<Code>StrCat<Types…> instantiations
MakeErrorStream return-type ops390 MakeErrorStreamWithOutput<T> (one per StatusOr<T>)
Top placeholders%s (~2778), %d (~2031), %u (116), %zu (84), %x (55), %lld (52)
Largest subsystem blockCompile / HLO / verifier (~875 candidates)
Spot-confirmed against decompile12 of 12 representative templates (see report)

How To Read This Catalog

The two formatting idioms

printf-style   "Argument to Cholesky must have rank >= 2; shape was %s"
               filled by  absl::StrFormat(template, shape.ToString())
               or         LOG(ERROR) << absl::StrFormat(template, …)

positional     "ALL_TO_ALL not supported when buffer size is not divisble
                by number of endpoints. Buffer Size: $0 Number of
                endpoints: $1. MegascaleInfo: $2"
               filled by  absl::Substitute(template, size, n, info)

The printf surface uses %-specs; the positional surface uses $0, $1, … (and a handful of %v, the absl stringify-any sigil). The two never mix inside one template. A reimplementer keying off %-specs alone will miss the ~138 positional templates entirely.

Placeholder grammar

The full %-spec occurrence distribution across all templates is heavily skewed toward strings and ints:

SpecCountC typeReal C++ argument (typical)
%s~2778char*a ToString() / AbslStringify result — Shape, Layout, instruction name, device name, opcode
%d~2031inta dimension index, ordinal, count, status/state enum
%u116unsigneda count or index that cannot be negative
%zu / %zd84 / 19size_t / ssize_ta byte size or element count
%x / %#x / %02x55 / 19 / 11unsigned (hex)an address fragment, a register/bit mask, a chip ID byte
%lld / %ld / %lu / %llu52 / 48 / 40 / 4long long / long / unsigned longa large byte size or 64-bit count
%p32void*a buffer/driver-object pointer (driver + cost-model paths only)
%c24chara brace/bracket literal or an axis letter ('{', '}', 'X')
%f19double/floata ratio/fraction (cost model, hbm-fraction config)
%v6absl-stringifyan object with an AbslStringify overload (e.g. TensorCoreBundle)

GOTCHA — %s does not tell you the argument is a string in the program. It tells you the call site passed a const char*, which is overwhelmingly the product of an object's ToString(). When reimplementing, the real type behind a %s is recoverable only at the call site (e.g. whether the operand of "%s vs. %s" is a Shape::ToString() or an HloInstruction::name()); the catalog notes the likely type but does not byte-confirm it for the printf family.

Confidence and attribution

Templates marked CERTAIN were spot-confirmed verbatim in the decompiled call site. The subsystem grouping is a keyword classification of the prose, not a per-template call-site trace; it is HIGH confidence for the prose content and MEDIUM for the exact owning pass. The status code column is CERTAIN only for the 99 <Code>StrCat factories (the mangled symbol names the code) and for the 390 MakeErrorStream sites (the macro names the code); for every other template the code is inferred from the prose and the dominant idiom of its subsystem, and is marked MEDIUM or LOW accordingly.


Compile / HLO / Codegen / Verifier — ~875 templates

The single largest block. These are the shape-inference and HLO-verifier diagnostics a JAX/XLA user hits when a program is malformed: rank checks, dimension-range checks, shape-equality checks, layout mismatches. Almost all are %s/%d-heavy and almost all wrap InvalidArgument (a few RET_CHECK → Internal). The verifier's signature is the vs. comparison style"%d vs. %d", "%s vs. %s" — which appears wherever two quantities must match.

OffsetTemplate (%-spec)PlaceholdersCode
0x857d595Argument to Cholesky must have rank >= 2; shape was %s%s=ShapeInvalidArgument
0x857c6efArgument to symmetrize must have >= 2 dimensions, got %s%s=ShapeInvalidArgument
0x8580369All reduced tensors must have the same dimension. Tensor 0 has shape %s, Tensor %d has shape %s%s=Shape, %d=index, %s=ShapeInvalidArgument
0x85803c9All operands to AfterAll must be tokens; operand %d has shape %s%d=operand idx, %s=ShapeInvalidArgument
0x872a259broadcast_dimensions contains invalid value %d for result with rank %d%d=value, %d=rankInvalidArgument
0xa02f654Broadcast dimension %d mismatch: %d != %d; %s and %s.%d=dim, %d/%d=sizes, %s/%s=ShapesInvalidArgument
0xa02c26fCannot concatenate arrays that differ in dimensions other than the one being concatenated. Dimension %d in both shapes must be equal (or compatible): %s vs %s.%d=dim, %s/%s=ShapesInvalidArgument
0xa02f9baCannot bitcast types with undivisible bit-widths: %s => %s.%s/%s=PrimitiveTypeInvalidArgument
0xa01cdc3Bitcast requires a new on-device shape to have the same size of %d bytes, but got %d bytes.%d/%d=byte sizesInvalidArgument
0xa030ae8Cannot infer shape: attempting to index into non-tuple: %s.%s=ShapeInvalidArgument
0x857b3fasharding's tile count and device count does not match: %d vs. %d; shape=%s, sharding=%s%d/%d=counts, %s/%s=Shape/ShardingInvalidArgument
0x858b317Arguments to TriangularSolve have shapes with different ranks: %s vs. %s%s/%s=ShapesInvalidArgument
0xa0a2a82Binary op shape inference: %s; lhs: %s; rhs: %s is not implemented.%s=op, %s/%s=ShapesUnimplemented
0x8728000Binary op expects 2 operands, but got %d%d=countRET_CHECK→Internal
0x858400cBad scalar opcode in slot 0, opcode: %d bundle: %v, bits: %s%d=opcode, %v=TensorCoreBundle, %s=bitsInvalidArgument
0x857b280Cannot feed constants into bundle packer. Copy them to registers first. instr=%s%s=instructionInternal
0x8584e00Cannot find a free bundle slot in bundle %s: %s%s/%s=bundle/reasonInternal

NOTE — 0x858400c ("Bad scalar opcode in slot 0…") resolves in the decompile to platforms_deepsea::jellyfish::isa::DecoderBcsDf::DecodeScalar0Slot (and a per-gen DecoderJf twin). Its %v is the TensorCoreBundle's AbslStringify, and it is one of the 6 TensorCoreBundle-bearing <Code>StrCat factories (see Arg-Type Decode). The per-generation decoder variants (gxc/gfc, gxc/glc, vxc) each emit their own copy, so near-identical templates with the same prose are separate rodata strings.


Scheduler / Fuel / FIFO — ~15 templates

Latency-hiding scheduler, annotation-range checks, the --xla_fuel budget, and the on-device FIFO push/pop ordering. The annotation templates lean on %c for the literal brace/bracket characters; the FIFO templates chain %s :: %s to attach instruction context.

OffsetTemplatePlaceholdersCode
0x8796ba8annotation arg must be in correct order as given; expected %c{%d%c but got %c{%d%c%c=brace, %d=idInternal
0x8571b33annotation %c{%d%c is out of bounds%c=brace, %d=idInternal
0x858a654annotation range was not closed; expected %c}%c: %s%c=brace, %s=contextInternal
0x857fa91async-done for %s must be scheduled before %s%s/%s=instructionsRET_CHECK→Internal
0x857fabfasync-done for %s must be scheduled on core %d before %s%s=instr, %d=coreRET_CHECK→Internal
0x858a9b8Cannot schedule FIFO pop instruction when the FIFO is empty %s :: %s%s :: %s=instr contextInternal
0x857ad7eCannot schedule FIFO push instruction when the FIFO is full. FIFO name: %s. (element count %d vs %d). %s :: %s%s%s=name, %d vs %d=countsInternal
0xa02bf0cConflicting schedule type requirements in computation rooted at %s.%s=computationInternal
0xa086733Reference instruction %s was not found in the schedule.%s=instructionInternal
0x862868fGVN: Not replacing %s because GVN is out of fuel%s=instruction(LOG, not Status)
0x8628660halt before %s because lowering is out of fuel%s=instruction(LOG, not Status)
0xa03ca6aIllegal value for --xla_fuel. Saw %s, but expected token %s to be an integer.%s/%s=value/tokenInvalidArgument

NOTE — 0xa03ca6a ("Illegal value for --xla_fuel…") was confirmed inside the xla::MakeDebugOptionsFlags flag-parsing closure — it is a flag-value validator, the error counterpart to the --xla_fuel flag documented on the flag-name side. The two "out of fuel" strings at 0x8628660/0x862868f are diagnostic LOG output, not Status payloads (no factory wraps them); they are catalogued for completeness but should not be assumed to be recoverable as a Status code.


MSA / Memory / Allocation — ~79 templates

Memory-space assignment, prefetch/alternate-memory, HBM defragmentation, and the heap allocator. Byte-size placeholders dominate (%lld, %zu). The two over-budget templates wrap ResourceExhausted; mismatch/verification templates wrap Internal.

OffsetTemplatePlaceholdersCode
0xa030cd9AllocateBufferForMemorySpace: Unsupported memory space: %s.%s=memory spaceInvalidArgument
0x858aa58Allocation (size=%lld) would exceed memory (size=%lld) :: %s :: %s%lld/%lld=sizes, %s :: %s=contextResourceExhausted
0xa083c62BufferAllocation::Slice for instruction %s at index %s cannot be determined at compile-time.%s/%s=instr/indexInternal
0x857ce50DefineBuffer: Mismatch in memory spaces: %s vs %s%s vs %s=spacesInternal
0x8584ecdError defragmenting HBM %s: %s%s/%s=region/reasonInternal
0xa1300d0Failed to allocate %zu bytes. Memory limit: %zu bytes. Used: %zu bytes.)%zu×3=req/limit/usedResourceExhausted
0x8728f50Invalid HBM offset %d%d=offsetInvalidArgument
0x872cb26Invalid memory space for input memory space colors: %d%d=colorInvalidArgument
0xa01cf08Out of memory allocating %d bytes.%d=byte sizeResourceExhausted
0xa09a63eNumber of bytes %lld allocated must be a multiple of chunk size %lld.%lld/%lld=size/chunkInvalidArgument
0x857eed1Register allocator verification failure: live range %s; instruction %s%s/%s=range/instrInternal
0xa02b12fScoped allocation with size %s and limit %s exceeded scoped %s limit by %s.%s×4=sizes/labelsResourceExhausted

NOTE — do not attribute the near-identical template 0xa13e8e5 ("Failed to allocate node (%zu bytes). Memory limit: %zu [bytes]. Used: %zu [bytes].)") to TPU memory-space assignment. It lives in perfetto::protovm::RwProtoCursor::CreateNodeFromField — the Perfetto tracing library's arena allocator, not the TPU MSA path. The genuine TPU heap-allocator over-budget string is 0xa1300d0 ("Failed to allocate %zu bytes…", row above), confirmed in xla::AlignedAllocator::Allocate.


ICI / Collective — ~90 templates

Inter-chip-interconnect link health, routing, GTC synchronization, and the collective (all-reduce / all-gather / reduce-scatter / all-to-all) buffer-size validators. This block mixes printf-style ICI driver errors with the positional $N collective validators (which are megascale-tagged — see the next section). Codes lean Internal and DeadlineExceeded.

OffsetTemplateIdiomPlaceholdersCode
0xa0b3abcCannot find unicast link next hop routing table for link port %d.printf%d=portInternal
0xa030c51Coordinate assignment failed for the slice's target %s ICI network because there are chips disconnected from the rest of the slice: %s.printf%s/%s=target/chipsInternal
0xa0d5412Detected ICI link failures along %d dimensions, but only 1-dimensional link fault is allowed..printf%d=dim countFailedPrecondition
0xa05e59aFailed to add link information: chip %d already has a %c direction link.printf%d=chip, %c=axisInternal
0x855f8c6Failed to detect GTC reset before timeout %s expiresprintf%s=durationDeadlineExceeded
0x8727005Failed to turn down ICI link %d during slice reset, state=%dprintf%d=link, %d=stateInternal
0x871106bGTC failed to converge (max diff %d > %d) before timeout (%s) expiredprintf%d > %d=diff, %s=timeoutDeadlineExceeded
0x872a345Hop ID %d is out of bound of ICI route path with length %dprintf%d/%d=hop/lenInternal
0x8583d20ICI Probe failed. local port: %d name: %s took %d us. status: %sprintf%d=port, %s=name, %d=us, %s=statusInternal
0xa0a81a9ICI resiliency only allow 1-dimensional link failures, but link failures along %d dimensions are discovered.printf%d=dim countFailedPrecondition
0xa0ba533ICI routing failed to retrieve %dth hop dimension from bit encoded cache data.printf%d=hopInternal
0x9a573e9ALL_REDUCE Output buffer size is not == Input buffer size. Input size: $0 Output size: $1 Group Size $2 Key: $3 Module: $4 MegascaleInfo: $5positional$0..$5=sizes/key/module/infoInvalidArgument
0x9c142f9ALL_GATHER Input buffer size is not (Output buffer size / group size). Input size: $0 …positional$0…=sizes/infoInvalidArgument
0x9c14380REDUCE_SCATTER Output buffer size is not (Input buffer size / group size). …positional$0…=sizes/infoInvalidArgument
0x9d1493eALL_TO_ALL not supported when buffer size is not divisble by number of endpoints. Buffer Size: $0 Number of endpoints: $1. MegascaleInfo: $2positional$0/$1/$2Unimplemented

NOTE — 0x8583d20 ("ICI Probe failed…") was confirmed in asic_sw::driver::deepsea::ici::SliceConfiguration::GetLocalTopology; 0x9d1493e ("ALL_TO_ALL not supported…") in xla::megascale::runtime::HostCommandSchedulerFactory…GenerateCommunicationIrsFromTransferRegistry. The collective buffer-size validators are positional ($N) even though they sit on the ICI/collective path — they are emitted from the megascale runtime, which is the positional-idiom stronghold.


Megascale (DCN Runtime / Aggregator) — ~21 templates

Cross-host data-center-network coordination: barrier-participant accounting, the corrupted-buffer detector, launch-id timeouts, and the coordinator's error digest. This is the heart of the positional $N idiom. The coordinator's hang-digest emits one prose variant per cause branch.

OffsetTemplatePlaceholdersCode
0x9e6f85eExtra barrier participant. Expected: $0 Message $1$0=expected, $1=msgInternal
0x9e6fa4dMismatched number of barrier participants: Expected: $0 Msg: $1$0=expected, $1=msgInternal
0x9d149cbMegaScale Corrupted Buffer Detected. Key: $0 Checksum at Sender: $1 Current checksum: $2$0=key, $1/$2=checksumsDataLoss/Internal
0x9b273a4Timed out waiting for $0 graphs to complete at launch_id $1. Already completed: $2. StepGloballyInProgress: $3 Timeout: $4$0..$4=count/id/stateDeadlineExceeded
0xa122d03MegaScale devices cannot be queried except from jax. (%d)%d=error codeFailedPrecondition

The coordinator's hang digest emits the prose "Megascale detects a hang that is likely caused by …" once per cause branch (BAD_TPU_CHIP, BAD_SC_CHIP, DATA_INPUT_STALL, DIFFERENT_MODULE, FINGERPRINT_MISMATCH, NETWORKING_ISSUE, PROGRAM_NOT_QUEUED, UNKNOWN_CAUSE), and the operator-actionable follow-ups ("Please remove the hosts from the fleet and restart the workload", "Please check the workers to make sure the data input pipeline is working properly"). The abort path is in Fatal / Abort Surface.


SparseCore / Embedding — ~68 templates

SparseCore (xla_sc_) and BarnaCore embedding configuration: alignment requirements, table/feature counts, the partitioner objective enum, and the SMEM row-pointer budget. User-facing (a model-config error) and almost all InvalidArgument.

OffsetTemplatePlaceholdersCode
0xa0a9715barna_core_infeed_queue_hbm_address must be %d-byte aligned.%d=alignmentInvalidArgument
0xa0b6320barna_core_infeed_queue_hbm_size must be a multiple of %d.%d=multipleInvalidArgument
0xa030dd4Could not find valid TPU batch of length at least %d at position %d for row %d. The embedding work in one sample exceeds what the BarnaCore can process: %s. %s.%d×3=len/pos/row, %s/%s=detailInvalidArgument
0x872d950Dynamic learning rate tag: %d not found in the TPU embedding configuration, instead found: %d. tag set size: %d%d×3=tag/found/sizeInvalidArgument
0xa02cd93Embedding table is expected to have element type %s or %s.%s/%s=typesInvalidArgument
0x86fa1adFailed to parse TPU embedding partitioner optimization objective "%s". Valid options: performance, hbm_usage, hybrid%s=valueInvalidArgument
0xa11773ahbm_limits_for_embeddings.min_fraction (%f) must be <= hbm_limits_for_embeddings.max_fraction (%f)%f/%f=fractionsInvalidArgument
0xa0d36d1Invalid num_features: %d found for table: %s in the TPU embedding configuration. Valid values are >0.%d=count, %s=tableInvalidArgument
0xa0b7076Logical replicas must evenly divide the SparseCores in the system. logical_replicas = %d, physical_sparse_cores = %d.%d/%d=countsInvalidArgument
0xa069f4eNumber of TPU tables on row: %d exceeds what the BarnaCore hardware supports: %d > %d. This is mostly likely a result of incorrect partitioning.%d×3=row/got/maxInvalidArgument
0xa0ff40eRow pointers would exceed available SCS Smem (%d bytes > %d bytes)%d/%d=used/availResourceExhausted
0xa07d172Scatter operand has %d elements, which exceeds the 32-bit limit. Unsupported on SparseCore.%d=countUnimplemented

NOTE — 0xa0d36d1 ("Invalid num_features…") was confirmed in tensorflow::PopulateMissingFieldsInTPUEmbeddingConfig. The %f floats in 0xa11773a are a rare case where %f is genuinely a configuration ratio, not a cost-model internal.


Runtime / Driver / PJRT — ~177 templates

The driver state machine, device/ordinal validation, firmware-queue transitions, DMA-buffer accounting, and the PJRT C-API boundary. The most idiom-mixed block: %p and errno (%d) appear here, and codes split between FailedPrecondition (state-machine guards) and Internal.

OffsetTemplatePlaceholdersCode
0xa0b7366Attempted to register programmable interrupt with bad index: %d. Number of programmable interrupts: %d.%d/%d=index/countInvalidArgument
0xa0430a3Cannot remove a driver for %s, was not found in map.%s=driver nameNotFound
0xa077a78Cannot transition to %s: the firmware queues are not in %s state; they are in %s state.%s×3=statesFailedPrecondition
0x96c33b2Can't close driver while in state %s; are multiple threads trying to open / close?%s=stateFailedPrecondition
0x94b68ceCan't get the optimized program for executable \%s`: MPMD execution is not supported by PJRT C API`%s=executableUnimplemented
0xa09fdcdChip count (%d) is not supported.%d=countInvalidArgument
0x872d1a0Close of core dump fd failed with errno: %d%d=errnoInternal
0xa0a9937%d DMA buffers were still outstanding when the driver was re-opened. These buffers must be unmapped before the driver can be re-opened.%d=countFailedPrecondition
0xa0d10bdDevice id '$0' is out of bound. Number of devices is $1.$0/$1=id/countInvalidArgument
0x8679159device ordinal value (%d) must be non-negative%d=ordinalInvalidArgument
0xa1a96baexecutable is built for device %s of type "%s"; cannot run it on device %s of type "%s"%s×4=device/typeInvalidArgument
0xa00ab4bExpected %d chips per tray, actually found a tray with %d chips.%d/%d=expected/foundFailedPrecondition
0x858a3defailed initializing StreamExecutor for device ordinal %d: %s%d=ordinal, %s=reasonInternal
0xa09b555Failed to convert multipod chip id %d to single-pod chip id.%d=chip idInternal

NOTE — 0x8679159 ("device ordinal value (%d) must be non-negative") was confirmed in stream_executor::StreamExecutorAddressAllocator::GetStreamExecutor. Many runtime templates feed the executor's async stream path; see execute-async-on-stream.md for where these surface during enqueue.

PJRT-C-API / protobuf-descriptor (linked library, not TPU)

The positional $N family is also populated by a statically-linked protobuf extension-declaration validator. These are catalogued for completeness but are not TPU code and should not be attributed to libtpu's own surface:

OffsetTemplateNote
0xa0cf40b"$0" extension field $1 is expected to be $2.protobuf descriptor validator
0xa0eba97"$0" extension field $1 is expected to be type "$2", not "$3".protobuf descriptor validator
0xa0d0fab$0 cannot declare both \metadata` and `declaration` as extension declaration for extension #$1.`protobuf descriptor validator

CHECK / RET_CHECK / Self-Check Templates

The absl CHECK/QCHECK/DCHECK family and XLA's TPU_RET_CHECK do not carry full message templates — they emit a fixed prefix and then append the stringified source expression and any streamed << "msg". The comparison macros (CHECK_EQ/NE/GE) append the operand values with the "%d vs. %d" / "%s vs. %s" format that recurs across the verifier block.

OffsetTemplateMacro / origin
0xa1a64deCheck failed: 'absl CHECK/QCHECK prefix (quoted)
0xa1f4a87Check failed in absl CHECK with file/line
0xa285fb1Check failed: absl CHECK prefix (no quote)
0xa183292TPU_RET_CHECK failure (XLA TPU RET_CHECK macro
0xa0ab3caHostname Verification Check failed.gRPC TLS hostname-verify CHECK
0xa2300c5MakeErrorStream destructed without getting absl::Status: XLA status_macros self-check
0xa2300ffMakeErrorStream shift called after getting absl::Status: XLA status_macros self-check
0xa27f1b3MakeErrorStream got absl::Status more than once: XLA status_macros self-check

GOTCHA — the CHECK(expr) << "msg" macro inlines the source expression text (spmem_buffer_type != nullptr, dynamic_size, nullptr) into .rodata. Large fragments of literal C++ source — even whole lambda bodies — appear in the string table as a side effect. These are CHECK-condition evidence, not error templates in the printf sense, and are excluded from the template count. A reimplementer scanning strings for "templates" must filter them out or be flooded with source fragments.

0xa183292 was confirmed in tpu::internal::RetCheckFailSlowPath; 0xa2300c5 in xla::status_macros::MakeErrorStream::Impl::~Impl (the destructor self-check that fires when a built-but-unconsumed Status is dropped).


Fatal / Abort Surface

The intentional-abort surface is small and centralized — ICI hard failures, the megascale coordinator abort, internal-bug LOG(FATAL)s, and one library guard that is not TPU-specific.

OffsetTemplatePath
0xa1e7cb3!!!! FATAL ERROR !!!! for ICI FatalErrorCheck
0x8864055!!!! FATAL ERROR !!!! observed errors are: [AsyncDriver::HandleFatalError composite
0xa046045Fatal error occurred. Data links will go down.ICI hard-failure marker
0xa1b3810FATAL ERROR RECEIVED FROM HARDWARE!!!hardware fatal interrupt
0x8a2941dFatal error in creation of RWB Fusion. Please file a bug with XLA-TPUfusion internal-bug LOG(FATAL)
0xbe7d460FATAL ERROR: This binary was compiled with <isa> enabled, but this feature is not available on this processor (go/sigill-fail-fast).absl CPU-feature startup guard (12 ISA variants)

The megascale coordinator's abort prose — "Aborting the coordinator after collecting errors from all workers as megascale_error_reporter_abort_on_hang is set to true. All workers will also abort after they detect the coordinator is shutdown." — is a LOG(FATAL) gated on a flag.

NOTE — the 0xbe7d460 family (12 variants: aes, avx, mmx, pclmul, popcnt, sse, sse2, sse3, sse4.1, sse4.2, ssse3, …) is the absl CPU-feature startup guard, a library abort that fires before any TPU code runs if the host CPU lacks an ISA the binary was built for. It is not a TPU diagnostic. 0x8a2941d ("Fatal error in creation of RWB Fusion…") was confirmed in xla::jellyfish::TpuInstructionFusion::RwbFusionHelper — a genuine internal-bug fatal with a "file a bug" marker.


Status-Code Mapping

A template becomes an absl::Status through one of three factory idioms. Only the first two name the code in the binary; the prose family and the bare printf-into-LOG paths require call-site disassembly to confirm the code.

IdiomCountWhat it isCode attribution
xla::status_macros::MakeErrorStream390RET_CHECK(c) << … / return InvalidArgument(…) << …; 390 MakeErrorStreamWithOutput<T> conversion ops, one per StatusOr<T> return type. The dominant XLA idiom by call-site count.code named by the macro at the site
xla::InvalidArgumentStrCat<…>38interleaved literal/typed-arg factoryCERTAIN → InvalidArgument
xla::UnimplementedStrCat<…>31sameCERTAIN → Unimplemented
xla::InternalStrCat<…>29sameCERTAIN → Internal
xla::ResourceExhaustedStrCat<…>1sameCERTAIN → ResourceExhausted
absl::<Code>Error("…") prose1+direct prose factory (e.g. absl::InternalError("Invalid error type"))code named by the function

The <Code>StrCat factory is the unique place where format-arg C++ types are recoverable without disassembly: each instantiation emits an Itanium-mangled symbol _ZN3xla<len><Code>StrCatIJ<typepack>EEC2E… whose <typepack> names every interleaved literal segment and typed argument byte-for-byte.

Arg-Type Decode (mangled type pack)

RA<N>_Kc   const char (&)[N]  — a literal segment of N chars (N counts the
                                 NUL, so the visible literal is N-1 chars)
RA<N>_S1_  another const char (&)[N] — a later literal (S1_ back-references
                                 the already-named const char type)
m / l / i / f          unsigned long(size_t) / long / int / float, by value
Rm / Rl / Ri / Rf      &  of each
RKm / RKi              const unsigned long& / const int&
NSt…basic_string…      std::string by value;  RKNSt… = const std::string&
NSt…basic_string_view… std::string_view
N3tpu10TpuVersionE     tpu::TpuVersion (enum), by value
N…isa16TensorCoreBundleE  the per-gen ISA TensorCoreBundle (rendered via its
                          AbslStringify → the %v sigil)
N4absl8StatusOrI…E     absl::StatusOr<…>, by value

Worked decode (each a real instantiation symbol):

InvalidArgumentStrCatIJRA74_KcmEE
    → InvalidArgument("<73-char literal>", size_t)
InvalidArgumentStrCatIJRA16_KcRfRA20_S1_EE
    → InvalidArgument("<15ch>", float&, "<19ch>")     (the only float-bearing
      factory family — a cost/ratio message)
InvalidArgumentStrCatIJRA19_KcRlRA38_S1_RKiEE
    → InvalidArgument("<18ch>", long&, "<37ch>", const int&)
InvalidArgumentStrCatIJRA26_KcN3tpu10TpuVersionEEE
    → InvalidArgument("<25ch>", tpu::TpuVersion)
InvalidArgumentStrCatIJRA65_Kc…isa16TensorCoreBundleEEE
    → InvalidArgument("<64ch>", const TensorCoreBundle&)   (the "Bad scalar
      opcode … bundle: %v" family — per-gen gxc/gfc, gxc/glc, vxc variants)
UnimplementedStrCatIJRA253_KcEE
    → Unimplemented("<252-char literal>")   (the longest single-literal
      Unimplemented message — a "not supported" explanation block)

Byte-confirmed arg-type frequency across the 99 factories: std::string 29, std::string_view 15, long 13, TensorCoreBundle 6, size_t 5, StatusOr<…> 3, const int& 2, float 2, TpuVersion 1. Even at the byte-confirmed level, args are overwhelmingly string-ish + long; float and pointer are vanishingly rare — consistent with the printf-spec distribution above.

StatusCode keyword distribution

String-table keyword occurrence (noise-filtered), an upper-bound prose signal of which codes the surface favors — not a per-template count:

InvalidArgument 300 | Unimplemented 169 | NotFound 75 | FailedPrecondition 56
ResourceExhausted 21 | Unavailable 15 | OutOfRange 14 | Aborted 8
DeadlineExceeded 3 | AlreadyExists 3 | PermissionDenied 2 | DataLoss 1

GOTCHA — the raw Internal keyword count (~16,823) is dominated by the /internal/ source-path component in inlined __FILE__ strings, not by error templates. Do not read it as an Internal-status count. The byte-confirmed Internal total is the 29 InternalStrCat factories plus the MakeErrorStream Internal sites.


User-Facing vs Internal Split

The prose itself signposts the audience. User-facing (operator / JAX-user actionable, no bug id): the HLO verifier shape/rank/dimension block, the --xla_fuel flag-value error, the embedding-config errors with their valid-value hints, the megascale digest follow-ups, and the device/executable mismatch. Internal (XLA-bug markers): strings carrying b/<id>, go/<link>, "please file a bug", or "should not happen".

OffsetInternal-bug marker templateMarker
0x858c562Kernel body fingerprint collision detected for key: … Please file a bug with the XLA team …"file a bug"
0x8a2941dFatal error in creation of RWB Fusion. Please file a bug with XLA-TPU"file a bug"
0x96c1211XLA has not implemented dynamic sized slice with non-trival stride yet. Please file a bug against XLA"file a bug"
0x96c12d4Unimplemented reduce-window in fusion cost modeling. Please file a bug with XLA"file a bug"
0x9fd476dtightened domains are empty. This should not happen except if we proven infeasibility or optimality."should not happen"
0xa0c9452Encountered unexpected layout … This should not happen - please file a bug against XLA."should not happen" + "file a bug"
0x99e1405Close() appears to be hanging, this might be a deadlock see b/147787375b/<id>

NOTE — b/<id> and go/<link> tokens are the strongest internal signal. They also tag known-gap TODOs ("TODO(b/157237781) support VFIO device %s", "TODO: b/475913712 - Expected Gather indices to be bitpacked …"), which are not live errors but become Unimplemented messages when their guarded path is hit.


At-a-Glance: Templates per Subsystem

SubsystemCandidatesDominant idiomDominant codeAudience
Compile / HLO / verifier~875StrFormat %s/%d + RET_CHECKInvalidArgumentuser-facing
Runtime / driver / PJRT~177StrFormat %s/%d/%p + errnoInternal / FailedPreconditionmixed
ICI / collective~90StrFormat %d/%s + $N (msc)Internal / DeadlineExceededmixed
MSA / memory / allocation~79StrFormat %zu/%lldResourceExhausted / Internalmixed
SparseCore / embedding~68StrFormat %d/%s/%fInvalidArgumentuser-facing
Megascale (DCN runtime)~21Substitute $0/$1Internal / LOG(FATAL)mixed
Scheduler / fuel / FIFO~15StrFormat %s/%d/%cInternal / RET_CHECKmixed
PJRT-C-API / protobuf-desc(lib)Substitute $0/$1InvalidArgument(not TPU)
Total distinct templates~2,9372,799 printf + 138 positional

GOTCHA — the subsystem counts are first-match-wins over a prose keyword set (ordering: compile → MSA → ICI → SparseCore → runtime → scheduler → megascale). A template that matches two groups is counted once, under the earlier group. The total (~2,937) carries a ±~50 caveat where the error-prose gate admits a few help strings or rejects a few terse errors — those are properly hint strings (hint-strings.md), not errors.


Cross-References

  • Runtime & Execution Overview — where in the PJRT execution path these Status objects originate and propagate
  • Hint Strings — the suggestion surface (try/consider/recommended), contrasted with the failures here; the ±50 boundary cases live there
  • Internal Pass Names — pipeline-stage identifier strings that the compile/HLO error templates reference by name
  • Execute Async on Stream — the executor enqueue path that surfaces the runtime/driver StatusOr<T> templates at run time
  • Error / Status Codes (appendix) — the absl::StatusCode enumeration these templates are wrapped in
  • Memory-Space Table (appendix) — the memory-space color values named by the Unsupported memory space / memory space colors templates