Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

xla_* Flag Atlas

Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build libtpu_lts_20260413_b_RC00, build-id md5 89edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes). Other versions differ.

Abstract

libtpu registers exactly 2048 absl::Flag<T> globals (the AbslFlagHelpGenFor<name> symbol count). Every one of them is settable through a single funnel — LIBTPU_INIT_ARGS parsed by absl::ParseCommandLine — so the flag surface is, in effect, libtpu's entire command line. This page is the grouped atlas of that surface: not a flat 2048-row dump (that would be the anti-pattern this wiki exists to avoid) but a per-family taxonomy with per-subsystem deep-dives into the ~100 highest-signal knobs, each tagged with its inferred type, the proto field it backs where known, and a confidence label.

The reference frame is XLA's own flag system. The non-TPU xla_* flags are fields of xla::DebugOptions, registered by xla::MakeDebugOptionsFlags @ 0x1e66ce80 (confirmed: it takes a vector<tsl::Flag>* and a DebugOptions* and binds each field to a --xla_foo flag). The TPU-private families (xla_tpu_*, xla_jf_*, xla_sc_*, megascale_*, barna_core_*) are standalone absl::Flag globals whose values land in TpuCompilationEnvironment (TCE) via OverrideTpuCompEnvByCmdLineFlags @ 0x1d73e640, not in DebugOptions. Which proto a name lands in is the single most consequential structural fact for a reimplementer; that taxonomy is owned by flag-families.md and the protos by debugoptions-proto.md and tpu-compilation-environment.md. This page owns the catalog: the grouped name space and per-flag type/effect.

The authoritative name enumeration is the mangled helper-symbol set: every absl::Flag<T> FLAGS_<name> emits an _ZN<len>AbslFlagHelpGenFor<name>8NonConstEv symbol, so that symbol set is a 1:1 census of registered flags (length-prefix parsing recovers each <name> exactly); nm | rg -o 'AbslFlagHelpGenFor...8NonConstEv' | sort -u | wc -l returns 2048. Every catalogued name on this page resolves to an AbslFlagHelpGenFor<name> symbol in the binary. Types are convention-inferred for ~99% of flags (the enable_/use_/allow_ ⇒ bool, _ms/_kib/_count ⇒ int, _ratio/_factor ⇒ float, _file/_path ⇒ string, _mode/_level ⇒ enum heuristic XLA itself uses to register them); only a handful of defaults and types are byte-evidenced — most from =value clauses in error strings, plus xla_tpu_embedding_table_oblongness_threshold recovered from its AbslFlagDefaultGenFor initializer. Treat every type and default below as HIGH unless a row says CERTAIN (byte-evidenced) or LOW (ambiguous suffix).

For navigation, the contract is:

  • The family taxonomy — prefix → owning proto → count, so a reader knows where a name's field lives before chasing it.
  • The per-subsystem high-signal catalog — the ~100 flags a reimplementer of the TPU pipeline actually needs, grouped by scheduler / fusion / MSA / collectives / SparseCore / layout / numerics / autotune / debug / runtime.
  • The certainty boundary — which rows are byte-confirmed, and the err-string direction-of-default trap on the rest.
Name census2048 registered absl::Flag (AbslFlagHelpGenFor* symbols, sort -u)
Enumeration symbol_ZN<len>AbslFlagHelpGenFor<name>8NonConstEv (1 per flag)
DebugOptions registrarxla::MakeDebugOptionsFlags @ 0x1e66ce80 (binds xla_* fields)
TCE flag→field bridgeOverrideTpuCompEnvByCmdLineFlags @ 0x1d73e640 (TPU families)
FunnelLIBTPU_INIT_ARGS (str @ file 0x918c880) → absl::ParseCommandLine
Type split (inferred, all 2048)≈ bool 68% · int 21% · string · float · enum (suffix-convention, not byte-typed)
Byte-confirmed types/defaults~18 (most from =value error strings; oblongness from AbslFlagDefaultGenFor)
ConfidenceHIGH (convention-inferred) unless a row says CERTAIN (byte-evidenced) or LOW

1. Family Taxonomy — At a Glance

The prefix is the routing key: it decides which proto consumes the flag and which compiler/runtime subsystem owns its semantics. The counts below are per-prefix AbslFlagHelpGenFor* symbol counts and sum to the 2048 registered total. The Lands in column is the central distinction — xla_tpu_* are not DebugOptions fields, a trap overview.md §3 flags as the GOOD/BAD divide.

FamilyCountType-dominantLands inSubsystem owner
xla_tpu_*909bool / int / floatTCE (standalone)TPU compiler + runtime
(other + xla_vf_/xla_pf_)429mixedn/a (vendored libs)absl / grpc / protobuf / OR-tools
megascale_*150bool 73 / int 47 / str 14standalone absl::FlagDCN collective runtime
xla_jf_*148bool 109 / int 23TCEJellyfish XLA backend
xla_* (plain)121bool / int / enumxla::DebugOptionsgeneric XLA
xla_sc_*92bool 73 / int 13TCESparseCore LLVM backend
tpu_*69bool / int / strruntime/cache/driverTPU runtime
barna_core_*61float / int / durationstandalone absl::FlagBarnaCore embedding HW
xla_msa_*22bool / int / floatTCE + DebugOptions mixMemory-Space Assignment
tf_*20boolruntimeTF-TPU bridge
xla_gf_*14bool / int / enumTCE6acc60406/v7x VMEM/MSA
xla_mosaic_*8bool / enumTCEMosaic MLIR dialect
xla_ior_*4bool / enumTCE"IOR" fast-mem MSA variant
xla_llo_*1enumTCELLO annotation lifecycle

GOTCHA — the 429 (other) registered flags are almost all not TPU flags — 412 are absl, gRPC, protobuf, OR-tools, and cp_model library flags statically linked into the 745 MB binary; the remaining 17 are the tiny gen-codename mirrors (xla_vf_* 16, xla_pf_* 1) folded in here rather than given their own rows. (The owner partition on flag-catalog-full.md breaks xla_vf_ out separately and reports the pure vendored-lib bucket as 412.) A reimplementer enumerating AbslFlagHelpGenFor* symbols must filter to the xla* / tpu* / megascale* / barna_core* prefixes or pull in OR-tools' entire flag surface (absl_flags_*, cp_model_*). They are still settable through LIBTPU_INIT_ARGS, but they configure the vendored solvers, not the TPU compiler.

NOTE — there are zero xla_gpu_* flags registered (no AbslFlagHelpGenForxla_gpu_* symbol exists), yet 17 GPU/CPU fields survive in the shared DebugOptions descriptor as proto-only, flag-less fields. The TPU build strips the GPU flag wiring but keeps the GPU fields in the proto. The proto-only set is enumerated on debugoptions-proto.md.

The xla_tpu_* family — the bulk of the surface — itself splits across the subsystems the rest of this page deep-dives:

xla_tpu_* subsystemCountDeep-dive
misc / uncategorized229(long tail; not catalogued individually)
ICI / collectives174§5
fusion101§3
debug / dump / log77§9
MSA / memory-space55§4
SparseCore50§6
scheduler47§2
auto-sharding / SPMD40§7
layout29§7
memory / allocation27§8
dot / conv24(representative rows in §3)
autotune / AutoFDO24§10
numerics / precision21§3
cost-model8§2
runtime3§8

2. Scheduler (47 xla_tpu_)

Purpose

libtpu advertises five distinct latency-hiding scheduler engines behind separate gates — a reimplementer who assumes a single scheduler will mis-model the pipeline. The master gate is xla_tpu_enable_latency_hiding_scheduler; the four alternatives (ilp, brkga, dozer, lem) are independent variants. The BRKGA engine (Biased Random-Key Genetic Algorithm) carries its own population-tuning sub-family. The generic xla_* and xla_jf_* siblings (76 in the full scheduler group) carry the LHS resource model and the BRKGA fallback knobs.

Catalog — TPU scheduler gates

FlagTypeDefaultEffect
xla_tpu_enable_latency_hiding_schedulerbool(unrec)master LHS gate
xla_tpu_enable_ilp_latency_hiding_schedulerbool(unrec)ILP-formulated LHS
xla_tpu_enable_brkga_latency_hiding_schedulerbool(unrec)genetic (BRKGA) scheduler
xla_tpu_enable_dozer_latency_hiding_schedulerbool(unrec)"Dozer" variant
xla_tpu_enable_lem_schedulerbool(unrec)LEM variant
xla_tpu_consider_lp_llo_schedulerbool(unrec)LP-based LLO scheduler
xla_tpu_enable_latency_hiding_layer_schedulerbool(unrec)per-layer LHS
xla_tpu_enable_multi_compute_overlap_in_layer_schedulerbool(unrec)multi-compute overlap
xla_tpu_aggressive_flexible_annotation_schedulingbool(unrec)annotation aggressiveness
xla_tpu_scheduling_annotation_deannotate_unsupported_groupsAutoOr<bool>false (AUTO→off)deannotate annotation gaps
xla_tpu_enable_all_experimental_scheduler_featuresbool(unrec)turns on all experimental sched features

Catalog — BRKGA tuning + generic LHS

FlagTypeEffect
xla_tpu_brkga_latency_hiding_scheduler_generation_limitintBRKGA generations
xla_tpu_brkga_latency_hiding_scheduler_num_chromosomesintBRKGA population
xla_tpu_brkga_latency_hiding_scheduler_num_top_heap_computationsintBRKGA elite set
xla_tpu_brgka_latency_hiding_scheduler_no_progress_limitintBRKGA stall cutoff (note brgka typo)
xla_hlo_scheduling_brkga_generation_limitintgeneric BRKGA generations
xla_hlo_scheduling_brkga_enable_as_fallbackbooluse BRKGA only as fallback
xla_latency_hiding_scheduler_rerunboolre-run LHS pass
xla_latency_hiding_scheduler_resource_serializingboolserialize resource use
xla_latency_hiding_scheduler_enable_selective_resourcesboolselective resource tracking
xla_lhs_prioritize_async_depth_over_stallboolasync-depth priority
xla_lhs_make_all_gather_selectiveboolselective AG overlap
xla_lhs_threshold_for_applying_output_fusion_latency_multiplierfloatoutput-fusion latency mult. threshold
xla_jf_vliw_schedulerboolJellyfish VLIW post-scheduler
xla_jf_critical_path_schedulerboolcritical-path scheduler
xla_hlo_parse_memory_schedule_from_filestringreplay a fixed schedule

The 8 cost-model flags feed the scheduler's latency estimates: xla_tpu_emitter_learned_cost_model_options (string/proto — a learned-cost proto with no shipped ML client), xla_tpu_enable_instruction_cycle_checking (bool), xla_tpu_hbm_initial_cycle_penalty (int), xla_tpu_break_of_accum_cost_heuristic (bool), plus the generic xla_jf_random_latency and xla_jf_use_cost_based_memory_coloring.

QUIRK — the brgka spelling in xla_tpu_brgka_latency_hiding_scheduler_no_progress_limit is a typo in the flag name itself, distinct from the correctly-spelled brkga knobs. A reimplementer copying the BRKGA family by pattern will silently drop this knob unless they match both spellings. The typo is in the registered AbslFlagHelpGenFor symbol — it is the real flag name, not an extraction artifact.


3. Fusion (101 xla_tpu_)

Purpose

Fusion is the second-largest xla_tpu_ subsystem and carries the only cluster of byte-evidenced defaults on the whole surface — four of the five =value error strings live here. The gates control read-write-buffer (RWB) fusion, dot→dot chaining, nested-dot (PartialReduce) fusion, MRB accumulation, and the numerical tolerances for deep fusion. The generic xla_jf_* conv/multi-output fusion knobs and the SparseCore fusion gate round out the group.

Catalog — fusion gates (byte-evidenced cluster)

FlagTypeDefaultEffect
xla_tpu_rwb_fusionbooltrueread-write-buffer fusion
xla_tpu_dot_dot_fusionbooltruedot→dot fusion
xla_tpu_nested_dot_fusionbooltruenested-dot (PartialReduce) fusion
xla_tpu_accumulate_into_mrbbooltrueMRB accumulation fusion
xla_tpu_allow_deeply_nested_fusion_numerical_diffbooltruetolerate deep-fusion numerics
xla_tpu_fusion_debugger_instrument_inputsAutoOr<bool>false (Gen movw $0→AUTO; off if consumer AUTO→off)fusion-debugger input instrumentation
xla_tpu_allow_input_fusion_in_certain_reduce_opsbool(unrec)reduce-op input fusion
xla_tpu_allow_conv_input_fusion_with_downcast_convertbool(unrec)conv input fusion w/ downcast
xla_tpu_wrap_fusion_lowerable_hlos_in_loop_fusionbool(unrec)wrap lowerable HLOs
xla_tpu_enable_experimental_fusion_cost_modelbool(unrec)experimental fusion cost model

Catalog — generic fusion + dot/conv + numerics

FlagTypeEffect
xla_jf_enable_multi_output_fusionboolmulti-output fusion
xla_jf_enable_producer_consumer_multi_output_fusionboolproducer/consumer MOF
xla_jf_fusion_max_vmem_mibintper-fusion VMEM cap (MiB)
xla_sc_enable_instruction_fusionboolSparseCore instruction fusion
xla_tpu_enable_dot_strength_reductionbooldot → cheaper op
xla_tpu_enable_ragged_dot_kernelboolragged-dot kernel
xla_tpu_choose_faster_windowed_einsum_over_memboolwindowed-einsum speed/mem tradeoff
xla_jf_conv_full_precisionboolfull-precision conv
xla_jf_auto_assign_mxuboolauto MXU assignment
xla_tpu_accurate_exp / _log1p / _logisticboolaccurate transcendental family
xla_tpu_bf16_emission_modeenumbf16 emission policy
xla_tpu_experimental_enable_dynamic_int8_quantizationbooldynamic int8 quant (experimental)

GOTCHA — the help/error-string =value clause is the value the message tells you to setnot the registered default. The byte-authoritative default is the FLAGS_<name> inline literal at FlagImpl+0x48, and for this cluster it is 01 00 00 00 = true in every case: rwb_fusion, dot_dot_fusion, accumulate_into_mrb, nested_dot_fusion, and allow_deeply_nested_fusion_numerical_diff are all true by default. The error strings (e.g. in PartialReduceEmitter::ValidateShapes @ 0x10eaa120, AssignMrbEntriesToChains @ 0x10f4ac60) offer =false/=true as a workaround to flip an on-by-default knob, so reading the suggested value as the default inverts it. Trust the +0x48 union, never the prose; see tce-field-offsets-defaults.md.


4. Memory-Space Assignment (MSA)

Purpose

MSA controls where buffers live (VMEM / CMEM / HBM), how async copies prefetch across the memory hierarchy, and how the scoped-memory allocator (telamalloc) packs them. The knobs split three ways: the xla_tpu_* MSA family (55), the dedicated xla_msa_* namespace (22), and the per-generation xla_gf_vmem_* (6acc60406) / xla_ior_fast_mem_* overlays. Many MSA fields resolve through the AUTO tri-state rather than carrying a flat default — see autoproto-autoor-resolution.md.

Catalog — xla_tpu_* MSA

FlagTypeEffect
xla_tpu_alternate_memory_benefit_scaling_factor_for_large_buffersfloatMSA benefit scaling
xla_tpu_async_copy_bandwidth_scaling_factorfloatasync-copy BW model
xla_tpu_allocate_scoped_vmem_at_same_offsetboolscoped VMEM offset reuse
xla_tpu_allocate_scoped_cmem_at_same_offsetboolscoped CMEM offset reuse
xla_tpu_allow_in_cmem_copyboolpermit copies into CMEM
xla_tpu_scoped_cmem_for_all_reduceboolscoped CMEM for all-reduce
xla_tpu_vmem_scavenging_modeenumVMEM scavenger policy
xla_tpu_vmem_use_telamallocbooltelamalloc VMEM allocator
xla_tpu_scoped_vmem_limit_kibintscoped-VMEM byte budget (KiB)

Catalog — xla_msa_* namespace (22)

FlagTypeEffect
xla_msa_enableboolMSA master gate
xla_msa_max_outstanding_prefetchesintprefetch concurrency cap
xla_msa_max_outstanding_evictionsinteviction concurrency cap
xla_msa_max_cross_program_prefetchesintXPP count cap
xla_msa_max_repacks / _max_retriesintrepack / retry budgets
xla_msa_min_overlap_to_async_copy_ratiofloatmin overlap ratio
xla_msa_preferred_overlap_to_async_copy_ratiofloatpreferred overlap ratio
xla_msa_max_overlap_to_mem_size_async_copy_ratiofloatoverlap-vs-memsize ratio
xla_msa_enable_window_prefetchboolwindow prefetch
xla_msa_enable_sync_copy_replacementboolsync→async copy replacement
xla_msa_expanded_scoped_alternate_memory_modeenumscoped-alt-mem mode
xla_msa_experimental_ior_algorithmenum"IOR" eviction algorithm (experimental)
xla_msa_use_bundle_aware_cost_modelboolbundle-aware cost model
xla_msa_cost_model_optionsstringcost-model config string

Per-generation overlays: xla_gf_vmem_max_outstanding_evictions / _max_repacks / _max_retries (int, 6acc60406), xla_gf_vmem_use_ior_algorithm (enum), xla_ior_fast_mem_* (4 flags, the fast-mem round-trip MSA variant). The generic xla_enable_cross_program_prefetch and xla_default_cross_program_prefetch_heuristic gate XPP at the DebugOptions level.


5. Collectives / ICI (174 xla_tpu_)

Purpose

The largest xla_tpu_ subsystem. It covers the inter-chip-interconnect (ICI) collective emitters (all-reduce, all-gather, reduce-scatter, all-to-all), the resilient/fault-aware route selection, the sflag (sync-flag) wait watchdogs and hang-attribution telemetry, and the ICI-SDC (silent-data-corruption) test harness. The megascale_* family (§ separate) is the DCN runtime layer above these.

Catalog — collective emitters + sflag watchdogs

FlagTypeDefaultEffect
xla_tpu_enable_sparse_core_reduce_scatter_v2AutoOr<bool>true (AUTO→on, but TpuVersion+second-field composite at EnableSparseCoreReduceScatterV2 @ 0x1d6b8660)SC ND reduce-scatter v2
xla_tpu_all_gather_collective_matmul_modeenum(unrec)collective-matmul AG mode
xla_tpu_all_gather_step_countint(unrec)AG ring step count
xla_tpu_all_reduce_vmem_contingency_kibint(unrec)AR VMEM reserve (KiB)
xla_tpu_all_to_all_max_rdma_size_kibint(unrec)A2A RDMA chunk cap (KiB)
xla_tpu_1d_uni_direction_ring_min_input_size_chunksint(unrec)1-D ring threshold
xla_tpu_use_resilient_collective_emitterbool(unrec)fault-aware route table
xla_tpu_add_barriers_around_aggregated_collectivesbool(unrec)barrier wrapping
xla_tpu_force_startup_barrier_in_binomial_all_reducebool(unrec)startup barrier
xla_tpu_combine_quantized_all_reduce_operandsbool(unrec)quantized-AR operand combine
xla_tpu_checksum_all_reduce_transfersbool(unrec)AR transfer checksum
xla_tpu_debug_sflag_wait_timeout_msint(unrec)TC sflag-wait watchdog
xla_tpu_debug_sc_sflag_wait_timeout_msint(unrec)SC sflag-wait watchdog
xla_tpu_collect_sflag_wait_statsbool(unrec)sflag-wait stats master
xla_tpu_collect_sflag_wait_hang_corebool(unrec)hang-attribution: core
xla_tpu_collect_sflag_wait_hang_ratefloat(unrec)hang-rate stat

Catalog — generic collectives + ICI-SDC harness

FlagTypeEffect
xla_enable_async_all_gatherboolasync AG (DebugOptions)
xla_enable_async_all_reduceboolasync AR (DebugOptions)
xla_enable_async_reduce_scatter_fusionboolasync RS fusion
xla_all_gather_combiner_threshold_countfloatAG combiner threshold
xla_all_reduce_latency_bound_threshold_in_bytesfloatAR latency-bound threshold
xla_enable_all_gather_2d_emitter / _3d_emitterbool2D/3D AG emitter
xla_tpu_ici_sdc_test_iterationsintICI-SDC test iterations
xla_tpu_ici_sdc_test_packet_size_chunksintICI-SDC packet size
xla_tpu_ici_sdc_test_inject_mismatch_for_testing_onlyboolinject ICI mismatch (testonly)
xla_tpu_ici_sdc_test_run_on_program_startboolrun harness at program start

The ICI-SDC test sub-family has 10 members (_iterations, _packet_size_chunks, _buffer_size_chunks, _delay_mask, _pipeline_depth, _max_distance, _emit_compact_code, _run_on_program_start, _inject_mismatch_for_testing_only, _sflag_wait_timeout_ms) — a self-test harness, not production tuning.


6. SparseCore + BarnaCore Embedding

Purpose

Two families serve the SparseCore (SC) embedding path: xla_sc_* (92) are the SparseCore LLVM-backend compiler/codegen knobs, and barna_core_* (61) are the BarnaCore HW embedding-accelerator runtime tunables. The xla_tpu_* side (50) carries the SC offload gates and the SC SDC checker. SC compiler flags land in TCE; BarnaCore flags are standalone runtime absl::Flag globals.

Catalog — xla_tpu_* SC offload + xla_sc_* compiler

FlagTypeDefaultEffect
xla_tpu_enable_offloading_gather_to_sparsecoreboolfalsegather offload to SC
xla_tpu_enable_offloading_scatter_to_sparsecoreenum (Tristate)ENABLED (Gen movb $2)scatter offload to SC
xla_tpu_enable_sc_log_recorderAutoOr<bool>false (AUTO→off)SC log recorder
xla_tpu_embedding_table_oblongness_thresholdfloat50.0embedding-table oblongness cutoff
xla_tpu_enable_sc_sdc_checkerbool(unrec)SparseCore SDC checker
xla_tpu_aggregate_data_dependent_sc_opsbool(unrec)data-dependent SC aggregation
xla_sc_enable_instruction_fusionbool(unrec)SC instruction fusion
xla_sc_enable_latency_hiding_schedulerbool(unrec)SC LHS
xla_sc_enable_tile_overlays / _scs_overlaysbool(unrec)tile / SCS overlays
xla_sc_enable_stack_elidingbool(unrec)stack eliding
xla_sc_enable_hbm_optimization_modeenum(unrec)SC HBM optimization mode
xla_sc_detect_nanbool(unrec)SC NaN detection
xla_sc_assert_levelenum(unrec)SC assertion level
xla_sc_dump_llvm_ir_tostring(unrec)dump SC LLVM IR
xla_sc_use_legacy_embeddings_loop_configsbool(unrec)legacy embeddings loop configs

Catalog — barna_core_* embedding runtime (61)

FlagTypeEffect
barna_core_max_hbm_fraction_for_embeddingsfloatHBM fraction cap for embeddings
barna_core_override_tpu_table_limit_fractionfloatper-table limit override
barna_core_software_row_sharding_hbm_usage_fraction_limitfloatSW row-sharding HBM cap
barna_core_master_partitioner_thread_countintpartitioner threads
barna_core_hot_id_profiler_top_n_multipleinthot-ID profiler top-N
barna_core_file_operation_timeoutduration/intfile-op timeout
barna_core_embedding_common_config_proto_pathstringembedding config proto path
barna_core_partitioner_optimization_objectiveenumpartitioner objective

7. Layout + Auto-Sharding

Purpose

Layout knobs (29 xla_tpu_) control tiling, the "large 2nd-minor" layout per element width (x16/x8/x4), relayout, and layout negotiation. Auto-sharding / SPMD (40 xla_tpu_ + 8 generic) controls the auto-SPMD partitioner's memory budget and solver, plus user-sharding preservation.

Catalog — layout + sharding

FlagTypeEffect
xla_tpu_allow_layout_negotiationboollayout negotiation gate
xla_tpu_enable_large_2nd_minor_layoutintlarge 2nd-minor layout master
xla_tpu_allow_large_2nd_minor_layout_for_x16intper-x16 variant
xla_tpu_allow_large_2nd_minor_layout_for_x8intper-x8 variant
xla_tpu_allow_large_2nd_minor_layout_for_x4intper-x4 variant
xla_tpu_allow_sharding_on_minor_dimintminor-dim sharding
xla_tpu_auto_spmd_partitioning_memory_budget_gbintauto-SPMD memory budget (GB)
xla_tpu_auto_spmd_partitioning_memory_budget_ratiofloatbudget ratio
xla_tpu_auto_spmd_partitioning_solver_timeout_secondsintsolver wall-clock cap
xla_tpu_auto_spmd_keep_all_user_shardingsboolpreserve user shardings
xla_tpu_auto_spmd_remove_all_user_shardingsboolstrip user shardings
xla_tpu_autotune_shardingsboolsharding autotune
xla_jf_spmd_threshold_for_windowed_einsum_mibfloatwindowed-einsum SPMD threshold (MiB)
xla_jf_bf16_propagationboolbf16 propagation

GOTCHA — xla_tpu_allow_large_2nd_minor_layout_for_x16 and its _x8 / _x4 siblings are typed int, not bool, despite the allow_ prefix that elsewhere signals a boolean. The _for_x16 suffix implies a tri-state-or-count integer, not an on/off gate (LOW confidence — the type was inferred from suffix, not byte-confirmed). A reimplementer must not assume every allow_* flag is boolean.


8. Memory / Allocation + Runtime

Purpose

Allocation knobs (27 xla_tpu_ + generic) control OOM handling, HBM/VMEM/SMEM spilling, defragmentation, and allocation backtraces. The runtime/cache/driver family (tpu_*, 69) controls the compilation cache, driver watchdogs, and core-dump behavior — these are runtime, not compile-time, knobs.

Catalog — allocation + runtime/cache

FlagTypeDefaultEffect
xla_tpu_impure_oom_fast_exit_thresholdint10 (+0x48=0x0a)OOM fast-exit threshold
xla_enable_megacore_hbm_spillbooltruemegacore HBM spill
xla_tpu_always_spill_to_default_memorybool(unrec)always spill to default mem (proto field)
xla_jf_poison_vmem_allocationsbool(unrec)poison VMEM allocs (debug)
xla_jf_memory_allocator_include_backtracebool(unrec)alloc backtraces
xla_jf_lsra_v2_spill_reporter_thresholdint(unrec)LSRA spill-report threshold
xla_hbm_logging_buffer_size_bytesint(unrec)HBM log buffer size
tpu_compilation_cache_persists_in_riegelibool(unrec)cache persistence format
tpu_persistent_compilation_cache_locationstring(unrec)cache location path
tpu_persistent_compilation_cache_ttl_secsint(unrec)cache TTL
tpu_driver_callback_watchdog_timeoutint(unrec)driver watchdog timeout
tpu_core_dump_directorystring(unrec)core-dump directory
tpu_log_allocations_on_oombool(unrec)log allocations on OOM
DANGEROUS_tpu_runtime_abi_verification_disabledbool(unrec)disables ABI verification

QUIRK — xla_tpu_impure_oom_fast_exit_threshold defaults to 10 (byte-evidenced: inline FlagImpl+0x48 = 0x0a, no Gen reloc) — a positive count, not a -1 "disabled" sentinel. The impure_ prefix is a libtpu naming convention marking ~30 non-deterministic / logging / side-effecting knobs (impure_cost_model_logging_options, impure_llo_lifecycle_log_mode, impure_probability_of_host_offloading). A reimplementer should treat impure_ flags as runtime-observable side channels, not pure compile decisions.


9. Debug / Dump / Log / Trace

Purpose

77 xla_tpu_ debug/dump knobs plus the generic xla_jf_dump_* and xla_enable_*_trace families (181 in the full group). These control HLO/LLO/MLIR dumps, tracing, NaN/SDC checking, and the log recorders. The xla_jf_dump_* family is the Jellyfish-backend dump surface; xla_sc_dump_* is the SparseCore equivalent.

Catalog — dump / trace / verify

FlagTypeDefaultEffect
xla_tpu_enable_tile_log_recorderboolfalsetile log recorder
xla_jf_debug_levelint1Jellyfish debug verbosity
xla_jf_run_verifierboolfalserun HLO verifier
xla_jf_dump_tostring(unrec)Jellyfish dump directory
xla_jf_dump_hlo_textbool(unrec)dump HLO text
xla_jf_dump_llo_htmlbool(unrec)dump LLO HTML
xla_jf_dump_isa_program_protostring(unrec)dump ISA program proto
xla_jf_dump_extended_fingerprintstring(unrec)extended fingerprint dump
xla_jf_collect_llo_stack_tracebool(unrec)collect LLO stack trace
xla_sc_dump_llvm_ir_tostring(unrec)dump SC LLVM IR
xla_sc_dump_mlir_tostring(unrec)dump SC MLIR
xla_enable_hlo_tracebool(unrec)HLO trace
xla_enable_mxu_tracebool(unrec)MXU trace
xla_enable_transpose_tracebool(unrec)transpose trace
xla_dump_hlo_memory_schedule_infobool(unrec)dump memory schedule info

Catalog — LLVM-emitter dumps (xla_llvm_*, 4)

FlagTypeEffect
xla_llvm_isa_emitterboolenable LLVM→ISA emitter
xla_llvm_isa_emitter_bundlesboolemit instruction bundles
xla_llvm_isa_emitter_forceboolforce the LLVM ISA emitter
xla_llvm_generate_xla_compatible_dwgboolXLA-compatible debug-with-graph

10. Autotune / AutoFDO

Purpose

23 xla_tpu_autofdo_* flags drive profile-guided optimization: fingerprint-keyed loading of pre-tuned flags, layouts, schedules, and shardings, plus the FlagNet predictor. AutoFDO is a fingerprint→tuning cache: a module's fingerprint keys a stored set of decisions that bypass the live cost models.

Catalog — AutoFDO

FlagTypeEffect
xla_tpu_autofdoboolAutoFDO master gate
xla_tpu_autofdo_profile_filestringprofile file path
xla_tpu_autofdo_load_module_layout_fingerprintstringper-module layout fingerprint
xla_tpu_autofdo_load_module_flag_fingerprintstringper-module flag fingerprint
xla_tpu_autofdo_module_flags / _module_layoutsboolapply flag / layout tunings
xla_tpu_autofdo_flagnetenumFlagNet predictor mode
xla_tpu_autofdo_flagnet_confidence_thresholdintFlagNet confidence cutoff
xla_tpu_autofdo_hlo_module_size_thresholdintsize threshold for AutoFDO
xla_tpu_autotune_layouts / _schedules / _shardingsboolautotune layouts/schedules/shardings
xla_tpu_autofdo_proposed_layout_filestringproposed-layout file

11. The Certainty Boundary

The entire catalog above rests on two extraction methods with different trust levels, and a reimplementer must respect the seam.

Names — CERTAIN. The 2048 registered names come from the AbslFlagHelpGenFor<name> mangled-symbol set, which is a 1:1 enumeration of absl::Flag globals (sort -u | wc -l = 2048). Every name catalogued on this page resolves to such a symbol. Additional flag-like strings appear in .rodata (deprecated aliases / error-message references) but are not registered flags; they are not counted in the 2048 and are out of scope here.

Types — HIGH, mostly inferred. Only the suffix convention (XLA's own registration convention) types ~99% of flags. The ambiguous suffixes (_threshold int-or-float, _mode/_level int-enum-or-string) are marked LOW per row. Byte-confirming a type needs the absl::Flag<T> template argument from the FLAGS_<name> symbol's RTTI — not done here.

Defaults — only 18 are CERTAIN. Most come from =value clauses in help/error strings; xla_tpu_embedding_table_oblongness_threshold is recovered directly from its AbslFlagDefaultGenFor initializer (movl $0x42480000 = 50.0f @ 0x1d7068c0), which overrides the =1 workaround value its error string suggests. Everything else lives in .text initializers (xla::DefaultDebugOptions() and the per-flag FLAGS_* static ctors) not recoverable from strings. The full byte-evidenced set:

FlagDefault
xla_tpu_accumulate_into_mrbtrue (+0x48=01)
xla_tpu_rwb_fusiontrue (+0x48=01)
xla_tpu_dot_dot_fusiontrue (+0x48=01)
xla_tpu_nested_dot_fusiontrue
xla_tpu_allow_deeply_nested_fusion_numerical_difftrue
xla_tpu_fusion_debugger_instrument_inputsAUTO (Gen movw $0) → off
xla_tpu_scheduling_annotation_deannotate_unsupported_groupsfalse (AutoOr, AUTO→off)
xla_tpu_enable_tile_log_recorderfalse (+0x48=00)
xla_tpu_enable_sc_log_recorderfalse (AutoOr, AUTO→off)
xla_tpu_enable_sparse_core_reduce_scatter_v2true (AutoOr AUTO→on; version composite)
xla_tpu_enable_offloading_gather_to_sparsecorefalse
xla_tpu_enable_offloading_scatter_to_sparsecoreENABLED (Gen movb $2)
xla_tpu_impure_oom_fast_exit_threshold10 (+0x48=0x0a)
xla_tpu_embedding_table_oblongness_threshold50.0 (float)
xla_enable_megacore_hbm_spilltrue
xla_jf_debug_level1
xla_jf_run_verifierfalse
megascale_use_numa_aware_threadpooltrue (+0x48=01)

NOTE — for the ~330 TCE fields that are AutoProto oneofs, "default" is not even a flat value — it is an AUTO-resolution polarity baked into each consumer, optionally rewritten by a per-TpuVersion MSA overlay. The effective value is flag-default ⊕ AUTO-polarity ⊕ per-version-overlay. That resolution is owned by autoproto-autoor-resolution.md; this atlas catalogs the flag names and inferred types, not their resolved values.


ComponentRelationship
AbslFlagHelpGenFor<name> @ symtabthe 1:1 name-enumeration symbol per flag
xla::MakeDebugOptionsFlags @ 0x1e66ce80registers the xla_* DebugOptions flags
OverrideTpuCompEnvByCmdLineFlags @ 0x1d73e640binds the TPU families into TCE
GetLibTpuInitArguments @ 0x20ccca20the LIBTPU_INIT_ARGS funnel for all flags
PartialReduceEmitter::ValidateShapes @ 0x10eaa120hosts the nested_dot_fusion=true evidence string

Cross-References

  • overview.md — the four-stage flag→DebugOptions→TCE→effective-value pipeline this atlas sits inside
  • flag-families.md — the prefix→owner taxonomy in full; which proto each family lands in
  • env-vars.mdLIBTPU_INIT_ARGS and the env-var roster that feeds the parse
  • debugoptions-proto.mdxla::DebugOptions: the 290-field schema the plain xla_* flags back (full descriptor decode; the earlier "111 wire-fields / 94 flag-wired" figure was a partial sample, superseded there)
  • tpu-compilation-environment.md — the 1121-field TCE proto the xla_tpu_* / xla_jf_* / xla_sc_* flags land in
  • autoproto-autoor-resolution.md — the AUTO tri-state that makes "default" a resolution rule for ~330 fields
  • tce-field-offsets-defaults.md — the byte-exact field→offset→default reference where the non-evidenced defaults are recovered