Flag Catalog (Full)
All counts, symbols, and offsets on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (buildlibtpu_lts_20260413_b_RC00, build-id md589edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, ELF x86-64 DYN, not stripped). Other versions differ.
Abstract
This appendix is the exhaustive prefix index of libtpu's flag surface: every XLA_FLAGS / LIBTPU_INIT_ARGS / xla.DebugOptions name string the binary registers, grouped by prefix, with per-group counts, a representative enumerated subset, the inferred type, and the byte-evidenced default where one survives. It is the machine-style companion to the grouped narrative in xla-flag-atlas.md: the atlas explains what the high-signal knobs do; this page is the complete reference table a reader greps. Where the two disagree on a count, this page wins — its numbers come directly from the binary.
The authoritative name census is the mangled helper-symbol set. Every absl::Flag<T> FLAGS_<name> global emits an _ZN<len>AbslFlagHelpGenFor<name>8NonConstEv helper symbol, so that symbol set is a 1:1 enumeration of registered flags; length-prefix parsing recovers each <name> exactly. The binary carries 2048 such distinct symbols — the registered-flag count. A further set of names appears only in .rodata (deprecated aliases, error-message-only references) that are not backed by a live AbslFlagHelpGenFor symbol; folding those in yields 2107 distinct flag names. The two numbers answer two different questions: 2048 is "how many flags can you set," 2107 is "how many flag names exist as strings in the binary." Both are used below, labelled explicitly per row.
Every flag is settable through one funnel: LIBTPU_INIT_ARGS (env string @ file 0x918c880) is read by GetLibTpuInitArguments @ 0x20ccca20, split argv-style, and handed to absl::ParseCommandLine inside RealInitGoogle @ 0x210ae860. Because the parse is generic, the entire 2048-flag set is reachable through that one variable — there is no init-args-private subset. The xla_* (non-TPU) flags also bind to xla::DebugOptions fields via MakeDebugOptionsFlags @ 0x1e66ce80; the TPU-private families (xla_tpu_*, xla_jf_*, codenames, megascale_*, barna_core_*) are standalone globals that land in the TpuCompilationEnvironment via OverrideTpuCompEnvByCmdLineFlags @ 0x1d73e640, not in DebugOptions. Which proto a name lands in is owned by flag-families.md; this page owns the grouped name index.
This appendix is a pure-reference catalog: it carries no reimplementation contract of its own (the registration mechanism is reimplemented from xla-flag-atlas.md and registry-mediated-flags.md). It provides:
- A per-prefix count table — registered count and rodata-name count for each of the ~14 prefix namespaces, with scope and confidence.
- Per-prefix sections — for each prefix, the subsystem split (where applicable) and a substantial enumerated subset of names with inferred type, the highest-value TPU-specific flags spelled out in full.
- The certainty boundary — the 13 flags whose error-string
=valueremedy clauses survive (these spell the non-default direction, not the default), and the convention-inference caveat on the type column for the rest.
Registered flags (AbslFlagHelpGenFor symbols) | 2048 |
| Distinct rodata flag names (registered + rodata-only) | 2107 |
| Enumeration symbol | _ZN<len>AbslFlagHelpGenFor<name>8NonConstEv (1 per registered flag) |
| DebugOptions registrar | MakeDebugOptionsFlags @ 0x1e66ce80 (binds generic xla_* fields) |
| TCE flag→field bridge | OverrideTpuCompEnvByCmdLineFlags @ 0x1d73e640 (TPU families) |
| Funnel | LIBTPU_INIT_ARGS (str @ file 0x918c880) → GetLibTpuInitArguments @ 0x20ccca20 → absl::ParseCommandLine (RealInitGoogle @ 0x210ae860) |
| Type split (all 2107 names) | bool 1431 (68%) · int 434 (21%) · string 93 · float 79 · enum/string 70 |
Error-string =value remedy clauses | 13 (each spells the non-default direction, not the default); actual defaults live in .text initializers |
xla_gpu_* / xla_cpu_* registered flags | 0 (GPU/CPU flag wiring stripped from this TPU build) |
NOTE — two different
xla_tpu_*counts measure two different things. The binary'sAbslFlagHelpGenForsymbol set holds 909 registeredxla_tpu_*flags — the settable surface. There are 968 distinctxla_tpu_*name strings in.rodata(909 registered + 59 rodata-only references), which is not the count of settable flags. This page splits the two columns per prefix so the distinction is explicit: use 909 for "how many--xla_tpu_*=...you can pass," 968 for "how manyxla_tpu_*strings exist."
Per-Prefix Index (at a glance)
Every prefix namespace in the catalog, with the registered-flag count (binary AbslFlagHelpGenFor census) and the rodata-name count (registered + rodata-only). The "rodata names" column matches the prior 2107-union tally; the "registered" column is what is actually settable. Confidence is CERTAIN for counts derived directly from the symbol census, HIGH where the rodata-name delta carries concatenation noise.
| Prefix | Registered | Rodata names | Scope | Owner proto |
|---|---|---|---|---|
xla_tpu_ | 909 | 968 | TPU-specific compiler + runtime knobs | TCE |
megascale_ | 150 | 150 | Megascale DCN collective runtime | standalone |
xla_jf_ | 148 | 148 | Jellyfish TPU XLA backend (all gens) | TCE |
xla_ (plain) | 121 | 138 | Generic XLA (scheduler / MSA / collective / dump) | DebugOptions |
xla_sc_ | 92 | 92 | SparseCore compiler (SCS/SCC LLVM backend) | TCE |
tpu_ | 69 | 69 | TPU runtime / compilation-cache / driver | standalone |
barna_core_ | 61 | 61 | BarnaCore embedding-engine runtime | standalone |
xla_msa_ | 22 | 22 | Memory-Space-Assignment (dedicated namespace) | TCE |
tf_ | 20 | 20 | TensorFlow-TPU bridge (tf_jf_* etc.) | standalone |
xla_vf_ | 16 | 16 | Gen-specific VMEM/MSA override mirror (vf codename) | TCE |
xla_gf_ | 14 | 14 | Gen-specific VMEM/MSA override mirror (gf codename) | TCE |
xla_mosaic_ | 8 | 8 | Mosaic MLIR custom-kernel dialect | TCE |
xla_ior_ | 4 | 4 | "IOR" fast-mem round-trip MSA variant | TCE |
xla_pf_ | 1 | 1 | Gen-specific ND-allreduce override (pf codename) | TCE |
xla_llo_ | 1 | 1 | LLO annotation lifecycle | TCE |
xla_gpu_ | 0 | — | GPU backend (proto-only, no flag) | DebugOptions |
xla_cpu_ | 0 | — | CPU backend (proto-only, no flag) | DebugOptions |
(other / no XLA prefix) | 412 | 395 | abseil / grpc / protobuf / OR-tools library flags | standalone |
| Total | 2048 | 2107 | — | — |
NOTE — "registered total 2048" and "rodata names 2107" are both byte-derived: the registered total is the count of distinct
AbslFlagHelpGenFor<name>8NonConstEvsymbols (2048, confirmed byrgover the names sidecar); the 2107 is that set unioned with the rodata-only references (deprecated aliases, error-string mentions). The "other / no XLA prefix" registered count is the residual (2048 − the 1636 XLA/TPU-prefixed registered flags = 412); those are statically-linked library flags (alsologtostderr,alarm_on_failure, OR-toolscp_model_*, grpc internals) of no compiler interest.
xla_tpu_ — TPU Compiler + Runtime Knobs (909 registered / 968 names)
The dominant family — the TPU-private compiler and runtime knob surface, registered as standalone absl::Flag globals that land in TpuCompilationEnvironment, not in DebugOptions. (The sole two exceptions wired to DebugOptions are xla_tpu_detect_nan and xla_tpu_detect_inf — see the xla_ section.) Subsystem split, by keyword classification over all 968 xla_tpu_* name strings (sums to 968, not the 909 registered subset):
| Subsystem | Count | Keyword signature |
|---|---|---|
| misc / uncategorized | 288 | (no dominant keyword) |
| ICI / collectives | 174 | ici, all_reduce, all_gather, reduce_scatter, all_to_all, collective, sflag, dcn, barrier |
| fusion | 101 | fusion, fuse, rwb, dot_dot, nested_dot, multi_output, horizontal |
| debug / dump / log / trace | 77 | dump, debug, log, trace, verify, nan, recorder, assert |
| MSA / prefetch / scoped mem | 55 | msa, memory_space, prefetch, scoped_(v|c)mem, async_copy, cmem, telamalloc |
| SparseCore (TC-side) | 50 | sparse_core, sparsecore, _sc_, embedding, minibatch, offload |
| scheduler | 47 | schedul, latency_hiding, lhs, ilp, brkga, critical_path |
| auto-sharding / SPMD | 40 | sharding, spmd, partition, shardonnay, propagat |
| layout | 29 | layout, minor_dim, 2nd_minor, transpose, relayout, _x16/_x8/_x4 |
| memory / allocation | 27 | allocat, hbm, vmem, spill, oom, defragment |
| dot / conv | 24 | dot, conv, matmul, mxu, gemm, einsum |
| autotune / autofdo | 24 | autotun, autofdo, flagnet |
| numerics / precision | 21 | accurate_, _exp, _log, precision, bf16, fp8, stochastic |
| cost-model | 8 | cost_model, cycle, learned_cost, roofline |
| runtime | 3 | runtime, init |
Scheduler — five engine gates
The scheduler family advertises five distinct engines, each behind its own gate. Defaults are unrecoverable from strings unless evidenced.
| Flag | Type | Default | Purpose |
|---|---|---|---|
xla_tpu_enable_latency_hiding_scheduler | bool | — | master LHS gate |
xla_tpu_enable_ilp_latency_hiding_scheduler | bool | — | ILP-based LHS engine |
xla_tpu_enable_brkga_latency_hiding_scheduler | bool | — | genetic (BRKGA) scheduler engine |
xla_tpu_brkga_latency_hiding_scheduler_generation_limit | int | — | BRKGA generation cap |
xla_tpu_brkga_latency_hiding_scheduler_num_chromosomes | int | — | BRKGA population size |
xla_tpu_brkga_latency_hiding_scheduler_num_top_heap_computations | int | — | BRKGA elite-set size |
xla_tpu_brgka_latency_hiding_scheduler_no_progress_limit | int | — | BRKGA stall cutoff (note brgka typo) |
xla_tpu_enable_dozer_latency_hiding_scheduler | bool | — | "Dozer" scheduler variant |
xla_tpu_enable_lem_scheduler | bool | — | LEM scheduler variant |
xla_tpu_consider_lp_llo_scheduler | bool | — | LP-based LLO scheduler |
xla_tpu_enable_depth_memory_pressure_reduction | bool | — | depth-based memory-pressure reduction |
xla_tpu_enable_cp_send_done_scheduling | bool | — | collective-permute send/done sched |
xla_tpu_aggressive_flexible_annotation_scheduling | bool | — | scheduling-annotation aggressiveness |
xla_tpu_scheduling_annotation_deannotate_unsupported_groups | bool | false (errstr remedy =true) | deannotate annotation gaps |
xla_tpu_enable_all_experimental_scheduler_features | bool | — | enable all experimental sched features |
QUIRK — the name
xla_tpu_brgka_latency_hiding_scheduler_no_progress_limitcarries a transposed-letter typo (brgkavs thebrkgaused by its three siblings). It is a distinct registered flag string, not an alias — a reimplementer must register the misspelt name verbatim or this knob is unreachable.
ICI / Collectives — largest subsystem (174)
| Flag | Type | Default | Purpose |
|---|---|---|---|
xla_tpu_debug_sflag_wait_timeout_ms | int | — | TC sflag-wait watchdog |
xla_tpu_debug_sc_sflag_wait_timeout_ms | int | — | SparseCore sflag-wait watchdog |
xla_tpu_use_resilient_collective_emitter | bool | — | fault-aware route table |
xla_tpu_collect_sflag_wait_hang_core | bool | — | hang-attribution telemetry |
xla_tpu_collect_sflag_wait_hang_rate | float | — | hang-rate statistic |
xla_tpu_force_startup_barrier_in_binomial_all_reduce | bool | — | startup barrier injection |
xla_tpu_binomial_all_reduce_use_physical_core_ids | bool | — | physical-core-id binomial AR |
xla_tpu_all_gather_collective_matmul_mode | enum/string | — | collective-matmul AG mode |
xla_tpu_all_gather_step_count | int | — | AG ring step count |
xla_tpu_all_reduce_vmem_contingency_kib | int | — | AR VMEM reserve |
xla_tpu_all_to_all_max_rdma_size_kib | int | — | A2A RDMA chunk cap |
xla_tpu_async_ragged_all_to_all_max_rdma_size_kib | int | — | ragged A2A RDMA cap |
xla_tpu_add_barriers_around_aggregated_collectives | bool | — | barrier wrapping |
xla_tpu_aggressive_opt_barrier_removal | bool | — | opt-barrier removal |
xla_tpu_checksum_all_reduce_transfers | bool | — | checksum AR transfers |
xla_tpu_1d_uni_direction_ring_min_input_size_chunks | int | — | 1-D ring threshold |
The ICI-SDC test harness contributes a 10-flag sub-family: xla_tpu_ici_sdc_test_{iterations, packet_size_chunks, buffer_size_chunks, delay_mask, pipeline_depth, max_distance} (all int), xla_tpu_ici_sdc_test_{emit_compact_code, run_on_program_start, inject_mismatch_for_testing_only} (bool), xla_tpu_ici_sdc_test_sflag_wait_timeout_ms (int).
Fusion (101)
| Flag | Type | Default | Purpose |
|---|---|---|---|
xla_tpu_rwb_fusion | bool | true (errstr remedy =false) | read-write-buffer fusion |
xla_tpu_dot_dot_fusion | bool | true (errstr remedy =false) | dot→dot fusion |
xla_tpu_nested_dot_fusion | bool | false (errstr remedy =true) | nested-dot (PartialReduce) fusion |
xla_tpu_accumulate_into_mrb | bool | true (errstr remedy =false) | MRB accumulation fusion |
xla_tpu_allow_deeply_nested_fusion_numerical_diff | bool | — | tolerate deep-fusion numerics |
xla_tpu_allow_input_fusion_in_certain_reduce_ops | bool | — | reduce-op input fusion |
xla_tpu_allow_conv_input_fusion_with_downcast_convert | bool | — | conv input fusion w/ downcast |
xla_tpu_async_collective_fusion_fuse_multiple_collectives | bool | — | multi-collective async fusion |
xla_tpu_enable_async_collective_fusion_fuse_all_gather | bool | — | AG async collective-fusion fuse |
xla_tpu_enable_async_collective_fusion_fuse_all_reduce | bool | — | AR async collective-fusion fuse |
xla_tpu_copy_fusion_minimum_copy_size_in_bytes | int | — | copy-fusion size floor |
xla_tpu_enable_experimental_fusion_cost_model | bool | — | experimental fusion cost model |
xla_tpu_fusion_debugger_instrument_inputs | bool | — | fusion-debugger input instrumentation |
MSA / scoped memory (55)
| Flag | Type | Default | Purpose |
|---|---|---|---|
xla_tpu_alternate_memory_benefit_scaling_factor_for_large_buffers | float | — | MSA benefit scaling |
xla_tpu_async_copy_bandwidth_scaling_factor | float | — | async-copy BW model |
xla_tpu_allocate_scoped_vmem_at_same_offset | int | — | scoped VMEM offset reuse |
xla_tpu_allocate_scoped_cmem_at_same_offset | int | — | scoped CMEM offset reuse |
xla_tpu_allow_in_cmem_copy | bool | — | permit copies into CMEM |
xla_tpu_scoped_cmem_for_all_reduce | bool | — | AR result in scoped CMEM |
xla_tpu_cmem_max_outstanding_prefetches | int | — | CMEM prefetch cap |
xla_tpu_cmem_max_overlap_to_mem_size_async_copy_ratio | float | — | CMEM overlap ratio |
xla_tpu_vmem_use_telamalloc | bool | — | telamalloc VMEM allocator |
xla_tpu_scoped_vmem_limit_kib | int | — | scoped VMEM byte limit (KiB) |
xla_tpu_autotune_memory_space_assignment | bool | — | MSA autotune |
SparseCore TC-side (50)
| Flag | Type | Default | Purpose |
|---|---|---|---|
xla_tpu_enable_offloading_gather_to_sparsecore | bool | — | gather offload to SC |
xla_tpu_enable_offloading_scatter_to_sparsecore | bool | — | scatter offload to SC |
xla_tpu_enable_offloading_copy_to_sparsecore | bool | — | copy offload to SC |
xla_tpu_enable_offloading_reduce_to_sparsecore | bool | — | reduce offload to SC |
xla_tpu_enable_sparse_core_reduce_scatter_v2 | bool | false (errstr remedy =true) | SC ND reduce-scatter v2 |
xla_tpu_enable_sc_log_recorder | bool | false (errstr remedy =true) | SC log recorder |
xla_tpu_enable_async_sc_call | bool | — | async SC call |
xla_tpu_embedding_table_oblongness_threshold | int | — (errstr remedy =1) | embedding-table oblongness cutoff |
xla_tpu_aggregate_data_dependent_sc_ops | bool | — | data-dependent SC aggregation |
Other high-signal xla_tpu_ knobs
- Numerics:
xla_tpu_accurate_{exp, exp2, expm1, log1p, log2, logistic, sigshift}(bool),xla_tpu_bf16_emission_mode(enum),xla_tpu_auto_reduce_precision(bool),xla_tpu_experimental_enable_dynamic_int8_quantization(bool). - Dot/conv:
xla_tpu_enable_dot_strength_reduction,xla_tpu_enable_ragged_dot_kernel,xla_tpu_choose_faster_windowed_einsum_over_mem(all bool),xla_tpu_impure_contract_ragged_conv_with(string). - Layout:
xla_tpu_allow_layout_negotiation(bool),xla_tpu_enable_large_2nd_minor_layout,xla_tpu_allow_large_2nd_minor_layout_for_{x16, x8, x4}(int). - Auto-sharding:
xla_tpu_auto_spmd_partitioning_memory_budget_gb(int),xla_tpu_auto_spmd_partitioning_memory_budget_ratio(float),xla_tpu_auto_spmd_partitioning_solver_timeout_seconds(int),xla_tpu_auto_spmd_keep_all_user_shardings(bool). - Cost-model:
xla_tpu_emitter_learned_cost_model_options(string/proto),xla_tpu_enable_instruction_cycle_checking(bool),xla_tpu_hbm_initial_cycle_penalty(int),xla_tpu_impure_cost_model_logging_options(string). - Debug/memory:
xla_tpu_enable_tile_log_recorder(bool; errstr remedy=true, so default false),xla_tpu_impure_oom_fast_exit_threshold(int; errstr remedy=-1for verbose logging),xla_tpu_always_spill_to_default_memory(bool).
xla_jf_ — Jellyfish XLA Backend (148)
The jf codename is the TPU XLA backend namespace (shared across generations). Lands in TCE. Subsystem split: misc 63, debug/dump 24, memory/alloc 15, fusion 10, MSA 9, ICI 8, dot/conv 7, sharding 4, scheduler 3, SparseCore 2, cost-model 2, numerics 1.
| Flag | Type | Default | Purpose |
|---|---|---|---|
xla_jf_debug_level | int | — (errstr remedy =2 for stack traces) | JF backend debug verbosity |
xla_jf_run_verifier | bool | — | run the JF HLO verifier |
xla_jf_vliw_scheduler | bool | — | JF VLIW scheduler engine |
xla_jf_critical_path_scheduler | bool | — | critical-path scheduler |
xla_jf_conv_{input,output,reshape}_fusion | bool | — | conv fusion variants |
xla_jf_enable_multi_output_fusion | bool | — | multi-output fusion |
xla_jf_fusion_max_vmem_mib | int | — | fusion VMEM ceiling (MiB) |
xla_jf_conv_{full_precision,increased_precision} | bool | — | conv precision controls |
xla_jf_auto_assign_mxu | bool | — | auto MXU assignment |
xla_jf_use_cost_based_memory_coloring | bool | — | cost-based memory coloring |
xla_jf_dump_{hlo_text,debug_info,llo_html} | bool | — | JF dump variants |
xla_jf_dump_isa_program_proto | string | — | ISA-program-proto dump path |
xla_jf_experimental_{cmem,vmem}_for_hlo_outputs | bool | — | experimental output placement |
xla_jf_spmd_threshold_for_windowed_einsum_mib | float | — | windowed-einsum SPMD threshold |
xla_ (plain) — Generic XLA / DebugOptions-Backed (121 registered / 138 names)
The non-codename xla_* flags. Unlike the TPU families, these bind to xla::DebugOptions fields via MakeDebugOptionsFlags @ 0x1e66ce80. Subsystem split over all 138 xla_* (plain) name strings (sums to 138, not the 121 registered subset): misc 32, ICI 23, scheduler 21, SparseCore 17, memory 15, MSA 14, debug 10, others 6.
| Flag | Type | Default | Purpose |
|---|---|---|---|
xla_enable_megacore_hbm_spill | bool | false (errstr remedy =true, untested) | enable megacore HBM spill |
xla_enable_cross_program_prefetch | bool | — | cross-program prefetch gate |
xla_default_cross_program_prefetch_heuristic | bool | — | CPP heuristic default |
xla_enable_async_{all_gather,all_reduce,collective_permute} | bool | — | async collective gates |
xla_enable_async_reduce_scatter_fusion | bool | — | async RS fusion |
xla_{all_gather,all_reduce,all_to_all}_latency_bound_threshold_in_bytes | float | — | latency-bound thresholds |
xla_all_gather_combiner_threshold_count | float | — | AG combiner threshold |
xla_enable_all_gather_{2d,3d}_emitter | bool | — | dimensional AG emitters |
xla_hlo_scheduling_brkga_{computation_limit,generation_limit} | int | — | HLO BRKGA scheduler tuning |
xla_latency_hiding_scheduler_rerun | int | — | LHS rerun count |
xla_hbm_logging_buffer_size_bytes | int | — | HBM logging buffer size |
xla_enable_post_msa_sync_slice_fusion | bool | — | post-MSA sync-slice fusion |
xla_hlo_parse_memory_schedule_from_file | string | — | external memory schedule path |
GOTCHA — the classic XLA dump/HLO knobs (
xla_dump_to,xla_hlo_profile,xla_dump_hlo_as_proto,xla_step_marker_location,xla_disable_hlo_passes) appear asxla.DebugOptionsfields and as.rodatastrings, but they are not registeredabsl::Flagglobals in this build. A direct cross-match of all 290 DebugOptions field names against the registered-flag set finds exactly two overlaps:xla_tpu_detect_nan(DebugOptions field 135) andxla_tpu_detect_inf(field 136). Every other dump/HLO knob is settable only through the PJRTCompileOptions.debug_optionsproto path, never throughLIBTPU_INIT_ARGS. A reimplementer who exposes--xla_dump_to=as a libtpu command-line flag is wrong about this build.
SparseCore & Embedding — xla_sc_ (92), barna_core_ (61)
xla_sc_* are the SparseCore-compiler LLVM-backend knobs (lands in TCE); barna_core_* are the BarnaCore embedding-engine runtime knobs (standalone).
xla_sc_ representative subset
| Flag | Type | Purpose |
|---|---|---|
xla_sc_enable_instruction_fusion | bool | SC instruction fusion |
xla_sc_enable_latency_hiding_scheduler | bool | SC LHS |
xla_sc_enable_scheduler_memory_pressure_tracking | bool | SC mem-pressure tracking |
xla_sc_enable_tile_overlays / _scs_overlays | bool | SC tile/SCS overlays |
xla_sc_enable_stack_eliding | bool | SC stack eliding |
xla_sc_enable_hbm_optimization_mode | bool | SC HBM optimization mode |
xla_sc_detect_nan | bool | SC NaN detection |
xla_sc_assert_level | enum | SC assert level |
xla_sc_compiler_backtrace_depth | int | SC backtrace depth |
xla_sc_elementwise_shape_scaling_factor | float | SC elementwise scaling |
xla_sc_async_wrapper_fusion_type | enum | SC async-wrapper fusion type |
xla_sc_dump_{llvm_ir_to,mlir_to,bundles_to} | string | SC IR/MLIR/bundle dump paths |
xla_sc_use_legacy_embeddings_loop_configs | bool | legacy embedding loop configs |
barna_core_ representative subset
| Flag | Type | Purpose |
|---|---|---|
barna_core_max_hbm_fraction_for_embeddings | int | HBM fraction cap for embeddings |
barna_core_hbm_savings_threshold_for_optimized_hbm_packing | float | optimized-packing savings threshold |
barna_core_fraction_batches_to_process_locally | bool | local-batch processing fraction |
barna_core_master_partitioner_thread_count | int | partitioner thread count |
barna_core_hot_id_profiler_top_n_multiple | float | hot-id profiler top-N multiple |
barna_core_enable_software_deduplication | bool | software dedup |
barna_core_enable_software_row_sharding | bool | software row sharding |
barna_core_file_operation_timeout | int | file-op timeout |
barna_core_embedding_common_config_proto_path | string | embedding-config proto path |
barna_core_partitioner_optimization_objective | enum | partitioner objective |
MSA Namespaces — xla_msa_ (22), xla_vf_ (16), xla_gf_ (14), xla_ior_ (4), xla_pf_ (1), xla_llo_ (1)
The dedicated memory-space-assignment namespaces. xla_msa_* is the generic MSA option set; xla_vf_* and xla_gf_* are gen-specific VMEM/MSA override sets (the vf / gf codename prefixes) carrying the same knob names scoped to that generation; xla_ior_* is the IOR fast-mem round-trip variant; xla_pf_* is a single ND-allreduce override; xla_llo_* is a single LLO-lifecycle flag.
xla_msa_ — full enumeration (22)
| Flag | Type | Purpose |
|---|---|---|
xla_msa_enable | bool | MSA master gate |
xla_msa_max_cross_program_prefetches | int | CPP prefetch cap |
xla_msa_max_outstanding_evictions | int | eviction cap |
xla_msa_max_outstanding_prefetches | int | prefetch cap |
xla_msa_max_repacks | int | repack cap |
xla_msa_max_retries | int | retry cap |
xla_msa_{min,preferred}_overlap_to_async_copy_ratio | float | overlap-to-async-copy ratios |
xla_msa_max_overlap_to_mem_size_async_copy_ratio | float | overlap-to-mem-size ratio |
xla_msa_enable_cross_program_prefetch_freeing | bool | CPP freeing |
xla_msa_enable_sync_copy_replacement | bool | sync-copy replacement |
xla_msa_enable_sync_slice_replacement | bool | sync-slice replacement |
xla_msa_enable_while_redundant_eviction_elimination | bool | redundant-eviction elimination |
xla_msa_enable_window_prefetch | bool | window prefetch |
xla_msa_cross_program_prefetch_permissive_mode | bool | permissive CPP mode |
xla_msa_default_cross_program_prefetch_heuristic | bool | CPP heuristic default |
xla_msa_expanded_scoped_alternate_memory_mode | enum | expanded scoped-AM mode |
xla_msa_use_bundle_aware_cost_model | bool | bundle-aware cost model |
xla_msa_cost_model_options | string | cost-model config |
xla_msa_experimental_ior_algorithm | enum | experimental IOR algorithm |
xla_msa_experimental_use_telamalloc | bool | experimental telamalloc |
xla_msa_allocate_scoped_memory_at_same_offset | bool | scoped-mem offset reuse |
xla_vf_ (16), xla_gf_ (14), xla_ior_ (4), xla_pf_ (1)
xla_gf_vmem_{max_outstanding_evictions, max_repacks, max_retries} (int), xla_gf_vmem_use_ior_algorithm (enum), xla_gf_vmem_enable_while_redundant_eviction_elimination (bool) — the gen-specific VMEM mirror of the xla_msa_* set; xla_vf_* carries the same vmem_* knob set (16 names, including xla_vf_allow_replicated_vmem_writes and xla_vf_allow_split_vmem). xla_ior_{fast_mem_round_trip_production_msa, fast_mem_run_production_msa, stored_solution_path, use_stored_solution} (4) carry the IOR fast-mem round-trip variant. xla_pf_enable_nd_allreduce (1, bool) is the lone xla_pf_* flag; xla_llo_annotation_lifecycle_strict_mode (1, enum) is the lone xla_llo_* flag.
Runtime / Driver — tpu_ (69), megascale_ (150), tf_ (20), xla_mosaic_ (8)
tpu_ — runtime / compilation-cache / driver (69)
Standalone runtime flags (not compiler knobs). Representative subset:
| Flag | Type | Default | Purpose |
|---|---|---|---|
tpu_use_tfrt | bool | — (errstr deprecates =false) | use TFRT runtime path |
tpu_compilation_cache_disable_coordination_service | bool | — | disable cache coordination |
tpu_persistent_compilation_cache_location | string | — | persistent cache path |
tpu_persistent_compilation_cache_ttl_secs | int | — | cache TTL |
tpu_local_compilation_cache_size_bytes | int | — | local cache size |
tpu_program_cache_eviction_policy | enum | — | cache eviction policy |
tpu_program_proto_compression | bool | — | proto compression |
tpu_link_up_check_timeout | int | — | link-up check timeout |
tpu_driver_callback_watchdog_timeout | int | — | driver-callback watchdog |
tpu_core_dump_directory | string | — | core-dump directory |
tpu_hbm_report_enable | bool | — | HBM report toggle |
tpu_log_allocations_on_oom | bool | — | log allocations on OOM |
tpu_hlo_breakpoint_debugger_server_port | int | — | HLO breakpoint debugger port |
DANGEROUS_tpu_runtime_abi_verification_disabled | bool | — | disable ABI verification (dangerous) |
NOTE — the lowercase
libtpu_*identifiers (libtpu_init_utils,libtpu_lockfile,libtpu_sdk_*,libtpu_telemetry_*,libtpu_version, thelibtpu_lts_20260413_b_build tag) are not flags — they are translation-unit / module name strings in.rodata. They are excluded from the 2107 catalog.
megascale_ (150) — DCN collective runtime
Top knobs: megascale_num_slices (int), megascale_slice_id (int), megascale_coordinator_address (string), megascale_transport_type (enum), megascale_enable_tpu_premapping (bool), megascale_enable_watchdog (bool), megascale_graph_hang_threshold (int), megascale_heartbeat_{interval,timeout}_ms (int), megascale_error_reporter_abort_on_{error,hang} (bool), megascale_use_heartbeat (bool), megascale_grpc_num_channels (int), megascale_use_mtls_for_grpc (bool), megascale_verify_checksums (bool), megascale_use_numa_aware_threadpool (bool; errstr remedy =false, so default true).
tf_ (20) and xla_mosaic_ (8)
tf_* are the TensorFlow-TPU bridge flags (tf_jf_* and similar). xla_mosaic_* are the Mosaic MLIR custom-kernel dialect flags, including the legacy xla_mosaic_deprecated_allow_implicit_single_buffering.
Defaults — the Certainty Boundary
proto3 carries no descriptor-level defaults, and the per-flag defaults live in xla::DefaultDebugOptions() and the FLAGS_<name> static initializers — both in .text, not recoverable from strings. What does survive is a set of help/error strings that spell a --flag=value clause. Critically, every such surviving clause is a remedy — the value the message tells the user to set when something goes wrong ("use --flag=false in the meantime", "set =true to enable") — so the spelled value is the non-default, and the implied actual default is its opposite. 13 flags carry such a =value remedy clause; the remaining flags' defaults leave no string at all.
| Flag | Type | Remedy =value (errstr) | Implied default |
|---|---|---|---|
xla_tpu_accumulate_into_mrb | bool | =false ("in the meantime") | true |
xla_tpu_rwb_fusion | bool | =false (reverted-on-fallback) | true |
xla_tpu_dot_dot_fusion | bool | =false (if failure persists) | true |
xla_tpu_nested_dot_fusion | bool | =true ("did you forget to set") | false |
xla_tpu_scheduling_annotation_deannotate_unsupported_groups | bool | =true (to deannotate gaps) | false |
xla_tpu_enable_tile_log_recorder | bool | =true (to enable logging) | false |
xla_tpu_enable_sc_log_recorder | bool | =true (to enable logging) | false |
xla_tpu_enable_sparse_core_reduce_scatter_v2 | bool | =true (SC ND RS needs) | false |
xla_tpu_impure_oom_fast_exit_threshold | int | =-1 (more detailed logging) | not string-recoverable |
xla_tpu_embedding_table_oblongness_threshold | int | =1 (avoid tiled layout) | not string-recoverable |
xla_enable_megacore_hbm_spill | bool | =true (to activate, untested) | false |
xla_jf_debug_level | int | =2 (enable stack traces) | not string-recoverable |
megascale_use_numa_aware_threadpool | bool | =false (to disable) | true |
GOTCHA — none of these are byte-confirmed defaults. Each
=valueis the value the message tells the user to set (the remedy), which is the non-default; the "Implied default" column is the inferred opposite for the booleans, and is genuinely unrecoverable for the int-valued knobs (where the remedy is a specific tuning value, not a sentinel-vs-default flip). Do not read the remedy value as the default — it is the opposite. The authoritative defaults for every flag require disassemblingDefaultDebugOptions()and theFLAGS_*ctors in.text. Four further flags sometimes cited with byte-defaults (xla_tpu_allow_deeply_nested_fusion_numerical_diff,xla_tpu_enable_offloading_{gather,scatter}_to_sparsecore,xla_tpu_fusion_debugger_instrument_inputs) carry no=valuestring at all and have no string-derivable default.
For the type column: the 13 types above are byte-corroborated from =value evidence; the rest are convention-inferred from the flag-name suffix (enable_/use_/allow_ ⇒ bool; _ms/_kib/_count/_size/_n ⇒ int; _ratio/_factor/_fraction ⇒ float; _file/_path/_dir/_proto ⇒ string; _mode/_type/_level ⇒ enum). This is XLA's own registration convention, so it is reliable but not per-flag byte-confirmed. A _threshold suffix may be int or float; _mode/_level may be int-enum or string — those are HIGH, not CERTAIN.
Cross-References
- xla_* Flag Atlas — the curated sibling: grouped narrative + per-subsystem deep-dive into the ~100 highest-signal knobs (point here for what a flag does)
- Flag Families — prefix → owner routing: which proto (DebugOptions vs TCE vs standalone) each prefix lands in, live-vs-inert verdict per family
- DebugOptions Proto — the
xla.DebugOptionsmessage (290 fields, 17 nested enums), the 2-field flag-wiring overlap (xla_tpu_detect_nan/inf) - Default DebugOptions — where the per-flag defaults live (
DefaultDebugOptions()+FLAGS_*static initializers in.text) - Registry-Mediated Flags — the
AbslFlagHelpGenForregistration mechanism and theMakeDebugOptionsFlags/OverrideTpuCompEnvByCmdLineFlagsbind sites - Flag Prefix Dispatch — the
TpuVersion-aware prefix-strip/select mechanism for the codename families - Environment Variables —
LIBTPU_INIT_ARGS,LIBTPU_ON_GCE,TPU_LOAD_LIBRARY, the parse funnel intoabsl::ParseCommandLine