User-Facing Hint Strings
All offsets on this page are file offsets (hex) into
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (libtpu_lts_20260413_b_RC00, build-id md589edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, ELF x86-64 DYN, not stripped). In this build.rodataVMA == file offset (1:1, base0x84a0000), so a file offset is also the load address. Other wheels will differ.
Abstract
This page catalogs libtpu's hint strings: the advisory prose the runtime and the XLA:TPU compiler emit alongside an error to tell the operator or JAX user what to do about it. They are the actionable companions to the raw error/status templates catalogued on error-templates.md. Where an error template states the fact ("X must have rank >= N: got %s"), a hint appends the remedy ("Please use --xla_tpu_rwb_fusion=false", "Reduce TPU memory usage", "Please remove the hosts from the fleet"). The two live adjacent in .rodata and are joined at a .text absl::StrCat/StrFormat callsite; this page owns the hint half and its trigger/knob mapping, not the templates.
The hint surface is signposted by five marker tokens, and the choice of token is itself a taxonomy of who fixes the problem. --flag=value, "Reduce … memory", and "Please remove the hosts …" are operator-actionable — the user has a self-service remedy. b/<id>, go/<link>, and "please file a bug" point the user back at the XLA/TPU team — they signal an internal limitation or a known issue, not a knob the user can turn. This split (operator-actionable vs file-it-upstream) is the page's primary axis, refined into seven actionability classes: flag-suggestion, doc-link, bug-report, capacity/OOM, deprecation, perf-tuning, and operator-action.
This is a reference catalog, not an algorithm. Each section groups hints by trigger and, where the binary shows it, pairs the hint with the condition that fires it and the flag/knob it names. Flag names are byte-confirmed real absl::Flag globals (each has an AbslFlagHelpGenFor<name> symbol); their help-text bodies live in the config flag families and are out of scope here. The page does not claim to know which error template each hint is concatenated onto — that link is a .text callsite not yet traced (see Not Traced).
The catalog contract:
- The five marker tokens and how they partition operator-actionable from file-it-upstream.
- The seven actionability classes, each with its hints grouped by subsystem.
- For each hint: the verbatim string, its file offset, and the flag/knob or tracker coordinate it references.
- The direction-of-default caveat: whether the advised flag value is the workaround or the on-switch — directional, not byte-confirmed.
| Section | .rodata of libtpu.so (base 0x84a0000, file-offset == VMA) |
| Marker tokens | --<flag>=<val> · go/<link> · b/<id> · "file a bug" · "Reduce … memory" |
Distinct go/ links | 54 (514 occurrences are per-gen ISA-counter docs, not remedies) |
Distinct b/<id> ids | 31 (all with recovered context) |
| Flag-suggestion hints | ~28 strings; 10/10 spot-checked flags confirmed as registered absl::Flag |
| Concatenation callsite | .text absl::StrCat/StrFormat — not traced (hint ↔ template link open) |
At a glance, by trigger:
| Actionability class | Trigger (what fires the hint) | Primary token | Count (TPU/XLA subset) |
|---|---|---|---|
| flag-suggestion | a knob is unset / mis-set, or a fallback path was hit | --<flag>=<val> | ~28 strings |
| doc-link (action) | a failure with a documented resolution playbook | go/<link> (non-counter) | ~14 |
| doc-link (ISA) | reading the ISA spec for a perf counter | go/<gen>-isa#anchor | 514 occ / ~16 distinct |
| bug-report | an internal invariant broke / unimplemented path | "file a bug" · b/<id> | ~40 prose + 31 b/<id> |
| capacity / OOM | HBM/VMEM/Smem pressure, allocation failure | "Reduce … memory" · OOM | ~15 |
| deprecation | a removed/migrating flag or feature is used | "deprecated" · "no longer" | ~7 |
| perf-tuning | a suboptimal-but-correct code path was taken | "suboptimal" · "inefficient" | ~6 |
| operator-action | fleet/topology/sequencing remedy, no bug-id | "Please …" · "Ensure …" | ~25 |
The Five Marker Tokens
The hint surface was located by rg over the string table for five token families. They are not just search anchors — the token a hint carries is a reliable predictor of its actionability class, because libtpu's authors use each token consistently.
--<flag>=<value> OPERATOR-ACTIONABLE → set this knob and retry
"Reduce … memory" OPERATOR-ACTIONABLE → free HBM / raise a limit
"Please remove …" OPERATOR-ACTIONABLE → fleet/topology remedy
go/<link> FILE-IT-UPSTREAM → read the team's doc/playbook
b/<id> FILE-IT-UPSTREAM → known issue, comment/track
"please file a bug" FILE-IT-UPSTREAM → internal invariant broke
NOTE — the split is a heuristic, not a hard partition: a single hint can carry both ("…suboptimal MXU throughput … Please file a bug with XLA:TPU compiler team, and use
--xla_tpu_accumulate_into_mrb=falsein the meantime." at0xa083064is both a bug-report and a flag-suggestion). When a hint mixes tokens it is catalogued under its primary remedy and cross-noted.
GOTCHA — the
go/andb/tokens are dominated by non-actionable uses. 514 of ~640go/occurrences are per-gen ISA-spec links embedded in hardware-perf-counter description strings ([counter](http://go/glc-sc-isa#anchor)), not error remedies. A reimplementer scanning for "actionable hints" by greppinggo/alone will drown in counter documentation; filter to the non-counter set (Doc-Link Hints).
Flag-Suggestion Hints
The self-service remedy class: an error or log line names a specific --xla_* / --megascale_* / --tpu_* / --deepsea_* flag and the value to set. Every flag named below was confirmed to be a registered absl::Flag global by the presence of its AbslFlagHelpGenFor<name> symbol (10/10 spot-checked resolved 1:1; flag help-text bodies belong to the config section).
Fusion / MXU
| Offset | Hint string (verbatim) | Flag referenced |
|---|---|---|
0x858bbcf | "(1) Please use --xla_tpu_rwb_fusion=false (and --xla_tpu_dot_dot_fusion=false if failure persists), Reason: found fallback window config while lowering fusion." | xla_tpu_rwb_fusion, xla_tpu_dot_dot_fusion |
0xa2b7034 | (duplicate of 0x858bbcf — same rwb/dot_dot fusion remedy; both copies byte-identical) | (same) |
0x96c35ed | "PartialReduce is designed to be used with fusion. Did you forget to set `--xla_tpu_nested_dot_fusion=true`?" | xla_tpu_nested_dot_fusion |
0xa083064 | "…suboptimal MXU throughput on this HLO. Please file a bug with XLA:TPU compiler team, and use --xla_tpu_accumulate_into_mrb=false in the meantime." | xla_tpu_accumulate_into_mrb |
SparseCore / embedding
| Offset | Hint string (verbatim) | Flag referenced |
|---|---|---|
0x9fe2e58 | "If you are seeing this error when attempting to compile a distributed embedding model try running with --xla_tpu_embedding_table_oblongness_threshold=1 to ensure the embedding table is not given a tiled layout." | xla_tpu_embedding_table_oblongness_threshold |
0xa073044 | "Skipping reduce-scatter decomposition as SC ND RS needs --xla_tpu_enable_sparse_core_reduce_scatter_v2=true." | xla_tpu_enable_sparse_core_reduce_scatter_v2 |
0xa07b05e | "Number of sparse cores for scatter offloading should not be 0. To disable, please set xla_tpu_enable_offloading_scatter_to_sparsecore to false." | xla_tpu_enable_offloading_scatter_to_sparsecore |
0xa07b0ee | "Number of SparseCores for gather offloading should not be 0. To disable, please set xla_tpu_enable_offloading_gather_to_sparsecore to false." | xla_tpu_enable_offloading_gather_to_sparsecore |
0x85c0718 | "Unsupported: computing max_nz_per_row requires --xla_sc_fused_scatter_in_compute_loop" | xla_sc_fused_scatter_in_compute_loop |
0x85c076e | "Unsupported: computing max_unique_nz_per_row requires --xla_sc_fused_scatter_in_compute_loop" | xla_sc_fused_scatter_in_compute_loop |
0x85c07cb | "Unsupported: computing max_unique_nz_per_row requires --xla_sc_fused_gather_in_compute_loop" | xla_sc_fused_gather_in_compute_loop |
0x86678da | "…To avoid this failure, try to set a larger number to '--fdo_config_sparsecore_allowed_dropped_id_count_per_epoch' flag" | fdo_config_sparsecore_allowed_dropped_id_count_per_epoch |
Scheduler / collective / infeed
| Offset | Hint string (verbatim) | Flag referenced |
|---|---|---|
0xa008ce4 | "…has gaps on the way from that operand to itself. You can use --xla_tpu_scheduling_annotation_deannotate_unsupported_groups=true to deannotate the unsupported groups." | xla_tpu_scheduling_annotation_deannotate_unsupported_groups |
0xa060e03 | "Attempted to use profile-guided latency estimator without the real cost model. Either enable --xla_tpu_scheduler_using_real_cost_model or unset --xla_tpu_impure_latency_hiding_scheduler_profile_path." | xla_tpu_scheduler_using_real_cost_model, xla_tpu_impure_latency_hiding_scheduler_profile_path |
0xa04a55e | "Sinking DCN collectives requires --xla_tpu_use_megascale_host_reduction." | xla_tpu_use_megascale_host_reduction |
0xa05844c | "Sharded infeed with non-uniform layouts is not supported. Try turning off the infeed layout optimization (--transpose_tpu_infeed=false) and report to XLA team." | transpose_tpu_infeed |
Memory / OOM / runtime / debug
| Offset | Hint string (verbatim) | Flag referenced |
|---|---|---|
0x9e6e39e | "…Aborting compilation early because it's unlikely to have enough device memory. Requires %s, has %s available. If more detailed logging is desired, set --xla_tpu_impure_oom_fast_exit_threshold=-1" | xla_tpu_impure_oom_fast_exit_threshold |
0xa074100 | ". Reduce TPU memory usage or set --jellyfish_executor_max_wait_time_for_releasing_memory_on_oom to a larger value." | jellyfish_executor_max_wait_time_for_releasing_memory_on_oom |
0xa01af59 | "Spilling sregs to HBM on megacore is implemented but untested; use --xla_enable_megacore_hbm_spill=true to activate, and please update b/177274769 with findings." | xla_enable_megacore_hbm_spill |
0xa07a6ba | ". Restart the job or disable set --megascale_use_numa_aware_threadpool=false." | megascale_use_numa_aware_threadpool |
0x96deddc | "…bytes. Pass --xla_hbm_logging_buffer_size_bytes= | xla_hbm_logging_buffer_size_bytes |
0xa06c059 | "Unable to log from the tile because tile log is NOT enabled. Use --xla_tpu_enable_tile_log_recorder=true to enable logging." | xla_tpu_enable_tile_log_recorder |
0xa06c0d5 | "…was invoked but logging was not enabled. Use --xla_tpu_enable_sc_log_recorder=true to enable logging." | xla_tpu_enable_sc_log_recorder |
0xb52d9a0 | "You need to set --deepsea_chips_per_host_bounds, --deepsea_host_bounds to match your setup. Contact tfrt-devs@ if you have any questions." | deepsea_chips_per_host_bounds, deepsea_host_bounds |
0x96c2015 | "…the threadpool for tensorflow operations is too small. Try increasing --45eac_num_operation_threads and please notify barna-core-devs@" | _45eac_num_operation_threads (BarnaCore) |
0x9fff409 | "Very slow compile? If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results." | xla_dump_to |
0xa00da29 | "…if you would like to reliably retrieve the error message, try running it again with --notpu_use_continuations." | tpu_use_continuations (negated) |
0x87fb08b | "[enable stack trace via --xla_jf_collect_llo_stack_trace]" | xla_jf_collect_llo_stack_trace |
Direction-of-default caveat
GOTCHA — the value a hint tells you to set is not necessarily the opposite of the default; read the verb. The "use
--flag=false… in the meantime" / "Please use--flag=false" hints (rwb_fusion,dot_dot_fusion,accumulate_into_mrb) name the workaround (a non-default toggle to escape a broken path). The "use--flag=trueto enable" hints (tile_log_recorder,sc_log_recorder,sparse_core_reduce_scatter_v2,nested_dot_fusion) name a feature that is off by default, where=trueis the on-switch. Treat these as direction-of-default; only the ~13 rodata-evidenced defaults are byte-confirmed (config section). Do not assume the advised value equals "the default".
Doc-Link Hints (go/<link>)
The go/ shortlink token splits cleanly into a small actionable playbook set and a large ISA-counter-documentation family. Only the first is a remedy; the second is reference material the profiler attaches to perf counters.
Actionable / playbook links (TPU/XLA-relevant)
go/ link | Hint string (offset) | Subsystem |
|---|---|---|
go/scoped-vmem | 0xa011573 ". See go/scoped-vmem for more details." | MSA / scoped VMEM |
go/sc-dynamic-bounded-slice | 0xa050247 "DynamicBoundedSlice lowering is only supported inside of custom fusions, see go/sc-dynamic-bounded-slice for more information." | SparseCore lowering |
go/xla_compile_runtime_flag_error | 0xa047216 "Runtime flags … must be set consistently between compile options and ABSL flags. … See go/xla_compile_runtime_flag_error for resolution." | compile-env flag consistency |
go/megascale-debug-playbook | 0xa21a318 "Debug dumping triggered. Refer to go/megascale-debug-playbook for further debugging. megascale_debug_dir = " | megascale runtime |
go/jfc-errata | 0x8644f10 / 0xa0d46d9 / 0xa0ee662 "…See errata entry go/jfc-errata# | JXC ISA errata |
go/llvm-crash-bug | 0xa033687 "invalid profile created. Please file a bug at: http://go/llvm-crash-bug and include the profraw files that caused this error." | PGO / LLVM crash report |
NOTE — these are the only
go/links a user can act on to resolve a TPU/XLA failure. A second cluster (go/protection-key-fault,go/general-protection-fault,go/stacktraces,go/cppstackoverflow,go/lsan,go/prod-naming-1-5,go/grpc-without-init-google-exemption,go/streamz-force-collection,go/no_file_or_rpc_during_init,go/redact-debug-string) comes from statically-linked Google library prose (signal handlers, sanitizers, gRPC, telemetry), not from TPU-specific code paths. They are present in the binary but are not libtpu remedies; treat them as library noise (MEDIUM relevance).
Per-gen ISA-counter doc links (the bulk: 514 occurrences)
The dominant go/ use is not a remedy. Each hardware performance counter the profiler exposes carries a Markdown link to its ISA-spec section inside the counter's description string, e.g. 0xa0dcfac "This counter counts hold-scalar-issue cycles for [yN_reserved](http://go/vfc-sc-isaa#yn-reserved)." The link prefix encodes the TPU generation:
go/jfc-isa, go/jfc-errata JXC (Jellyfish / Dragonfish)
go/pfc-isa, go/pfc-sc-isa, go/pfc-bc-isa,
go/pfc-memory-system PXC (Pufferfish)
go/vfc-sc-isa (94+ occ), go/vfc-mem VXC vector-fetch-core (Viperfish)
go/vxc-isa, go/vxc-sc-isa VXC ISA (Viperfish / Ghostlite vector)
go/glc-isa, go/glc-sc-isa (94 occ) GXC general-load-core (Ghostlite)
go/gfc-isa, go/gfc-sc-isa (97 occ), go/gfc-power GXC general-fetch-core
go/core (70 occ) generic core-doc anchor
QUIRK — one of these links is misspelled in the shipped binary:
go/vfc-sc-isaa(confirmed at0xa0dcfac, doublea). A reimplementer reproducing the counter-description corpus should preserve the typo to byte-match this build, not "fix" it. The full counter→anchor table is a separate large extraction and is not enumerated here.
Bug-Report Hints
The file-it-upstream class. These prose strings point the user at a tracker, a team, or a b/<id> known issue. They signal that the user has hit an internal limitation, not a knob to turn.
"file a bug" prose with tracker coordinates
| Offset | Hint string (verbatim, abbreviated) | Destination |
|---|---|---|
0x9f1d09c | "DeepseaPlatform and --tpu_use_tfrt=false are now deprecated. … please file a bug at: https://b.stripped_domain/issues/new?component=670280" | tracker component 670280 (TFRT) |
0xa039ed8 | "RoutingTableAnalyzer detects a potential deadlock! File a bug against SliceBuilder (https://b.stripped_domain/issues/new?component=503036). Please attach core dumps retrieved from Coroner." | tracker component 503036 (SliceBuilder) |
0xaf0bf90 | "Please report a bug at: https://github.com/google/jax/issues/new?assignees=apaszke" | github.com/google/jax (public) |
0xab9e630 (also 0xaba4a30/0xaba4b70/0xaba4c50) | "Please file a bug under Platforms > Performance > BarnaCore > Software with repro instructions" | BarnaCore component path (×4) |
0x858c562 | "Kernel body fingerprint collision detected for key: %016x%016x. Please file a bug with the XLA team and provide the colliding kernel bodies." | XLA team |
0x8a2941d | "Fatal error in creation of RWB Fusion. Please file a bug with XLA-TPU" | XLA-TPU |
0x96c1211 | "XLA has not implemented dynamic sized slice with non-trival stride yet. Please file a bug against XLA" | XLA |
0x96c12d4 | "Unimplemented reduce-window in fusion cost modeling. Please file a bug with XLA" | XLA |
0x96c1fd3 / 0xa0c9b76 | "Allocated address is not aligned, please file a bug to tfrt-devs@" / "Duplicate allocation found, please file a bug to tfrt-devs@." | tfrt-devs@ |
0xa036ccd | "… is an unsupported memory space in TpuCustomCallScopedVmemAdjuster. Please file a P2 feature request, against the XLA-TPU team …" | XLA-TPU (P2 FR) |
0xa03eb45 / 0xa03eb8b | "Manual sub-axis isn't supported. Please file a bug with a reproducer." / "Non-divisible sharding with unreduced axes isn't supported. …" | XLA (Shardy) |
0xa1b0507 | "The auto-sharding solver could not find a valid solution within the given time limit. Please report this as a bug!" | XLA (auto-sharding) |
The TPU-lowering invariant block (0xa0c8b1e..0xa0c9758)
A contiguous run of "should not happen" templates, one per HLO op the TPU backend expects to have been legalised away before lowering. These are pure internal-bug markers — there is no user remedy. The block is contiguous but not strictly homogeneous: the first string in the run, at 0xa0c8b1e, is a sibling variant phrased about layouts rather than an un-eliminated op — "Encountered mismatched layouts for select-and-scatter. This should not happen - please file a bug against XLA." The "Encountered <Op> op during TPU lowering…" template proper begins at 0xa0c8b8d (Dot).
0xa0c8b1e Encountered mismatched layouts for select-and-scatter.
This should not happen - please file a bug against XLA.
0xa0c8b8d Encountered <Op> op during TPU lowering that should have been eliminated
during an earlier phase of compilation.
This should not happen - please file a bug against XLA.
<Op> ∈ { Dot, Call, BatchNormTraining, BatchNormInference, BatchNormGrad,
Pad, Reverse, select-and-scatter, custom-fusion, OutputFusion, … } (~14 ops)
b/<id> known-issue / TODO references (31 distinct)
These pin a limitation to a tracked bug. They are not actionable beyond "comment/track" — the most a user does is leave a note on the bug. Representative rows from the 31 distinct ids follow.
b/<id> | Context (offset) | Class |
|---|---|---|
b/167392593 | 0xa00f6f4 (×4) "TODO(b/167392593): Support {bitcasts merging dims, sliced prefetches, …}." | MSA/prefetch TODO |
b/177274769 | 0xa01af59 "…use --xla_enable_megacore_hbm_spill=true … please update b/177274769 with findings." | megacore HBM spill |
b/433785288 | 0xa041d37 "…fixed by Shardy partitioner in the future, tracked in b/433785288. Contact Shardy or XLA team for help." | Shardy partitioner |
b/36072659 | 0xa0af6ba "Infeed buffer size … bug in the infeed operation (b/36072659). See the bug for a workaround …" | infeed deadlock |
b/147787375 | 0x99e1405 / 0x99e144d "Close() appears to be hanging, this might be a deadlock see b/147787375" / "Disable() appears to be hanging, …" | runtime deadlock |
b/282055166 | 0x9929df3 "Error may indicate firmware queue fullness: b/282055166" | runtime/firmware |
b/30481585 | 0xa12ba16 "Device reset is not yet supported on this platform (b/30481585)" | runtime/device-reset |
b/488336614 | 0xa0a0fb0 "b/488336614 Only single-SC element scatter add is supported." | SparseCore scatter |
b/422762004 | 0xa12c03b "Gather with implicit convert is not supported. (b/422762004)" | gather lowering |
b/343490729 | 0x87050df "Not implemented: TODO(b/343490729): stores packed int2 are not supported" | int2 lowering |
NOTE —
b/<id>ids that point a user at a workaround (b/36072659infeed,b/494604538cancellables) are the closest this class gets to actionable; the rest are progress-tracking only. The 6-digit-minimumrgthreshold means shorterb/<id>orcl/<id>changelist references may exist below it (LOW confidence on completeness).
Capacity / OOM Hints
The operator-actionable memory-pressure class: free memory, raise a limit, or wait for compaction. Grouped by where the pressure is detected.
Compile-time and runtime OOM
| Offset | Hint string (verbatim, abbreviated) | Remedy |
|---|---|---|
0x9e6e39e | "Aborting compilation early because it's unlikely to have enough device memory. Requires %s, has %s available. … set --xla_tpu_impure_oom_fast_exit_threshold=-1" | compile-time OOM abort + logging knob |
0xa074100 | ". Reduce TPU memory usage or set --jellyfish_executor_max_wait_time_for_releasing_memory_on_oom to a larger value." | runtime OOM → free memory or wait longer |
0xa07b361 | "Not enough HBM spill stack available, please increase." | increase HBM spill stack |
0xa04bde4 | "Allocation fails. Try again after compaction. Please note that compaction can be slow. If you want to achieve better performance, please manage TPU buffers carefully to avoid this compaction." | retry-after-compaction (capacity + perf) |
Embedding capacity
| Offset | Hint string (verbatim, abbreviated) | Trigger |
|---|---|---|
0xa01cc4f | "TPU embedding: out of memory allocating %lld bytes." | embedding HBM allocation |
0xa030f0a | "The current embedding configuration does not fit on the TPU due to HBM space constraints or more optimization algorithms than supported: aggregate_min_resource_count (%d) > total_resource_count (%d)." | embedding config exceeds HBM/optimizer resources |
0xa056bd4 | "The number of unique optimizers in the TPU embedding configuration exceeds the capacity of the TPU system." | too many distinct optimizers |
0xa07d172 | "Scatter operand has %d elements, which exceeds the 32-bit limit. Unsupported on SparseCore." | SparseCore 32-bit index limit |
Fusion / VMEM / Smem capacity guards
| Offset | Hint string (verbatim) | Limit |
|---|---|---|
0x84b1b50 / 0x84b1b7d / 0x9fd4955 | "Nested dot fusion would exceed vmem capacity" / "Custom Fusion would exceed vmem capacity" / "' exceeds VMEM capacity." | VMEM-capacity fusion guard |
0x84b1a2b / 0xa1b5992 | "Smem high-water mark exceeds memory capacity" / "SCS high-Smem usage exceeds Smem capacity" | Smem capacity |
0x9ff331b | "The input tensor is not on HBM/VMEM and it exceeds the HBM logging buffer limit." | HBM logging-buffer cap (paired with --xla_hbm_logging_buffer_size_bytes) |
0xa1eed4d | "Too many buffers are colored in the alternate memory. Could not reserve alternate memory for colored output of instruction " | alternate-memory coloring |
0xa122204 / 0xa122248 | "The size of next-hop routing table (%d) exceeds the chip limit (%d)" / "The size of routing table (%d) exceeds the chip limit (%d)" | ICI routing-table cap |
NOTE — the VMEM/Smem/routing-table guards are capacity rejections, not remedies — they tell the user the limit was hit but name no flag. The actionable OOM hints (
xla_tpu_impure_oom_fast_exit_threshold,jellyfish_executor_max_wait_time_for_releasing_memory_on_oom) are also flag-suggestions and cross-appear above.
Deprecation Notices
Sparse, and mostly flag-migration. Library "constexprs are no longer supported"-style noise is excluded; only TPU/XLA notices are listed.
| Offset | Notice (verbatim, abbreviated) | Migration |
|---|---|---|
0x9f1d09c | "DeepseaPlatform and --tpu_use_tfrt=false are now deprecated. This flag is slated for removal. …" | remove; file bug if it was needed |
0x9feecc8 | "--xla_tpu_impure_enable_packed_bf16_math_ops is deprecated. Please use --xla_tpu_bf16_emission_mode in TpuCompilationEnvironment." | → xla_tpu_bf16_emission_mode |
0x8a293e0 | "Non-SPMD model parallelism is no longer supported by XLA:TPU" | use SPMD |
0xa26350a | "[DEPRECATED_XLA_TPU_FLAG_USE] Deprecated TpuCompilationEnvironment flags were overridden: " | runtime marker (flags overridden) |
0xa2ad13b | "[DEPRECATED_XLA_TPU_FLAG_USE] Deprecated TpuCompilationEnvironment flags were present and not matching their default values:" | runtime marker (non-default deprecated flag) |
0xa0b19dd (also 0xa0b1a9a) | "The SegmentShardingHelperDivSimple class can only be used when … BarnaCores per row per task is equal to 1 … Use the SegmentShardingHelperDiv class instead." (also …ModSimple → …Mod) | BarnaCore class migration |
0xa0b1c7c | "DescriptionForDevice.* is not supported for TPU. Please use one of the GetExecutor methods instead." | → GetExecutor methods |
NOTE — the two
[DEPRECATED_XLA_TPU_FLAG_USE]strings are runtime markers the runtime prints when a deprecatedTpuCompilationEnvironmentflag is set; they prefix the offending flag list rather than naming a single flag. The deprecated-flag names themselves belong to the config section.
Perf-Tuning Suggestions
"Suboptimal but correct" notices: the code path works, but the user can do better.
| Offset | Suggestion (verbatim, abbreviated) |
|---|---|
0xa083064 | "…suboptimal MXU throughput on this HLO. … use --xla_tpu_accumulate_into_mrb=false in the meantime." (also a flag-suggestion + bug-report) |
0x9fd3a5e | ". Switch to allocate_output to avoid performance penalty." |
0x9ff1304 | "Concatenate fusion is inefficient." |
0xa041d37 | "…SPMD will replicate the tensor and then partition it … which is inefficient. … tracked in b/433785288." |
0xa29fd14 | "Layout inefficient dot whose output shape has small lane/sublane dimensions" |
0xa04bde4 | "…If you want to achieve better performance, please manage TPU buffers carefully to avoid this compaction." (also a capacity hint) |
NOTE — the MSA knob
xla_tpu_msa_inefficient_use_to_copy_ratio(a proto field at0xbfc053e) is the heuristic behind the "inefficient use-to-copy" perf path; it is named in the config layer, not in a hint string (LOW confidence that any user-facing hint prints it directly).
Operator-Action Prose
Self-service remedies that carry no bug-id: fleet/topology actions, op-sequencing protocols, init/lifecycle guidance, and team pointers.
Megascale hang digest (8 cause-branch remedies)
Each "Megascale detects a hang that is likely caused by <cause>. <remedy>." string names a different operator action keyed to the detected cause. The cause→remedy dispatch (which Cause enum selects which string) is a .text switch and is not traced.
| Offset | Cause → remedy (verbatim, abbreviated) |
|---|---|
0xa058553 | bad TPU sparse core chips → "Please remove the hosts from the fleet and restart the workload. If problem persists please contact Megascale XLA team." |
0xa05862f | bad TPU tensor core chips → "Please remove the hosts from the fleet and restart the workload. …" |
0x9ffc2f2 | networking issue → "Please examine the underlying networking stack for the following hosts." |
0x9fd7519 | data-input stall → "Please check the workers to make sure the data input pipeline is working properly." |
0xa06d7d2 | different modules on different devices → "Please confirm that all workers is running the exact same program. …" |
0xa079562 | inconsistent HLO module compilation → "This is likely a bug in JAX tracing or XLA compiler. Please inspect the HLO dumps …" |
0xa0d7cec | worker not queuing programs → "Check if your application is blocked/crashing and preventing JAX to queue the next TPU program (jitted functions)." |
0x9fe02ae | unknown cause → "Megascale detects a hang but cannot determine the root cause. Please inspect the full digest below." |
Associated aborts: 0xa045c93 "Aborting the coordinator as an unrecoverable error is reported …"; 0xa045d37 "Aborting the coordinator after collecting errors … as megascale_error_reporter_abort_on_hang is set to true …".
QUIRK — the bad-SC-chips (
0xa058553) and bad-TC-chips (0xa05862f) remedies are near-identical English with a one-word grammatical drift ("If problem persists" vs "If the problem persists"). They are two distinct rodata strings, one per cause-branch, not a single shared template — a reimplementer building the digest must emit the matching variant per cause, not deduplicate them.
TPU embedding op-sequencing protocol (0xa009da1..0xa04698b)
~10 "Ensure that … before …" hints enforce the embedding-parameter op ordering (load_ → retrieve_ → enqueue_/infeed). They describe a strict op-protocol the JAX/TF-TPU user must follow; the underlying state machine is not traced here.
"Cannot retrieve embedding parameters … until the previous retrieve/load operation
has completed. Ensure that the previous … ops have completed before …"
"TPU embedding infeeds cannot be run in the middle of retrieving/loading the embedding
parameters. Please quiesce all infeeds before retrieving/loading parameters and restart
the infeeds after … is complete. Ensure that the retrieve_/load_tpu_embedding_*_parameters
ops have completed before the enqueue_tpu_embedding_*_batch ops are run." (0xa046624 / 0xa046853)
"Embedding parameters can be loaded … only after the TPU system has been first initialized.
Ensure that the load_… ops are run only after tpu.initialize_system() is complete."
Init / driver-lifecycle hints
| Offset | Hint (verbatim, abbreviated) |
|---|---|
0x9fe0d47 | "Failed to get tpu system. Please call tf.tpu.experimental.initialize_tpu_system() before running any ops on tpu." |
0x9fd4688 | "PjRtClient already exists for TPU. This probably means you have already implicitly initialized PJRT. … call tf.tpu.experimental.initialize_tpu_system() explicitly …" |
0xa094bb2 | "No TPU_SYSTEM device found. Please ensure that you're connected to a host with a TPU_SYSTEM device." |
0x9fe65a1 | "Failed to initialize TPU system, please contact Cloud TPU Support." |
0xa07188f | "WARNING: could not determine %s, please set env var `%s` manually, otherwise libtpu.so may not properly initialize." |
0xa1e3f95 | "Are you using the right LibTPU version? This LibTPU is built for <…>" |
0xa1a96ba | "executable is built for device %s of type "%s"; cannot run it on device %s of type "%s"" |
0xa0012c5 | "… DMA buffers were still outstanding when the driver was destroyed. … please ensure all buffers are destroyed before destroying driver objects." |
0xa079d96 | "TPU driver close was incomplete; kernel reported %d device handles outstanding. … Ensure that all userspace access to the device has ended before invoking driver close." |
0x9ffb075 | "Lost connection to the SliceBuilder controller task (normally worker task 0). Please check controller task status." |
0xa1fd28c | "Please ensure that you have only created one TPUEmbedding" |
Contact-team pointers
tfrt-devs@ alignment / allocation / topology-bounds bugs
barna-core-devs@ embedding hot-id-replication threadpool sizing
Shardy team / XLA manual-axes / unreduced-axes sharding
Megascale XLA team persistent hang after host removal
Cloud TPU Support TPU-system init failure (the one external channel)
The CPU-Feature Startup Guard (go/sigill-fail-fast)
A fatal pre-main guard that aborts if the host CPU lacks an ISA feature the build requires. Twelve variants at 0xbe7d460+, one per feature:
FATAL ERROR: This binary was compiled with <isa> enabled, but this feature is not
available on this processor (go/sigill-fail-fast).
<isa> ∈ { aes, avx, mmx, pclmul, popcnt, sse, sse2, sse3,
sse4.1, sse4.2, ssse3, cmpxchg16b } (12 variants)
NOTE — this is the absl CPU-feature startup guard from statically-linked Google code, not a TPU-specific path — it tells the operator their host CPU is too old for the build. It is included here because it is user-facing fatal prose with a
go/remedy link, but it is not a libtpu remedy. The earlier error-template catalog named 11 of the 12 variants; the twelfth,cmpxchg16b, is confirmed present at0xbe7d460+(CERTAIN).
Not Traced
This page is the hint half of the diagnostic surface; several links to the rest of the surface are deliberately left open because they require disassembly this catalog did not do.
- Hint ↔ error-template link. Hints sit adjacent to their parent error template in
.rodata, but the concatenation is a.textabsl::StrCat/StrFormatcallsite. Which of the error-template surface each hint is appended to — and the resultingabsl::StatusCode— is not traced. - Direction-of-default. Whether a flag-suggestion's advised value is the default or the workaround is a
DefaultDebugOptions()/FLAGS_*ctor question, not byte-confirmed for every flag (only ~13 rodata-evidenced defaults are). The verb-based heuristic above is directional, not authoritative. - Log severity. The severity each hint is emitted at (WARNING / ERROR / FATAL) is set at the callsite, except where the prose itself says "FATAL" / "Aborting" / "WARNING".
- Megascale cause→remedy dispatch. The 8 hang-digest strings are recovered, but the
Cause-enum → string switch is not traced. go/<link>content resolves to internal Google URLs not present in the binary; only the link tokens and referencing prose are recovered.go/<gen>-isa#anchor→ counter map. The 514-occurrence ISA-doc-link family is summarised by gen-codename mapping only; the per-counter anchor table is a separate large extraction.
Cross-References
- Error Templates — the raw error/status templates these hints are appended to; the fact, where the hint is the remedy.
- Runtime Overview — where in the runtime the diagnostic surface sits.
- Internal Pass Names — the pass/lowering names that appear in many "should not happen during TPU lowering" bug-report hints.
- Flag Families (config) — the registered
absl::Flagglobals these hints name, with their help-text bodies and defaults. - XLA Flag Atlas (config) — the full
--xla_tpu_*/--megascale_*knob catalog the flag-suggestion hints point to.