do_init / do_fini
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (buildlibtpu_lts_20260413_b_RC00, build-id md589edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, ELF x86-64 DYN, not stripped; demangled C++ symbols quoted verbatim). Other versions differ.
Abstract
This page owns the C++ static-constructor iteration libtpu runs at load and the symmetric destructor teardown it runs at unload — the part of the lifecycle that lives entirely below the PJRT C-ABI and is driven by the C runtime, not by any framework call. When the dynamic linker walks .init_array (INIT_ARRAY @ 0x215f26f0, 2900 slots), it calls every translation-unit static-init function in link order; that storm is what fills libtpu's flag registries, protobuf descriptor pools, LLVM/MLIR backend tables, and — the one piece that matters for correctness — the GoogleInitializer module descriptors and their dependency edges. At unload the same machinery runs in reverse through __cxa_finalize, draining the LIFO list of destructors the constructors registered with __cxa_atexit.
The reference frame is the Itanium C++ ABI as any clang/libstdc++ binary implements it: per-TU _GLOBAL__sub_I_<file>.cc functions populate .init_array; function-local statics are made thread-safe with __cxa_guard_acquire/release/abort; non-trivial-destructor globals register a teardown callback with __cxa_atexit(dtor, obj, &__dso_handle); and __cxa_finalize(__dso_handle) drains that list at dlclose/exit. libtpu is unusual in three ways. First, it ships its own libc++abi — __cxa_guard_* at 0x213e9ac0 / 0x213e9be0 / 0x213e9c20 and __cxa_finalize are libtpu-internal, not glibc's, so the guard word layout and the finalize list are private to the image. Second, there is no glibc __do_global_ctors/__do_global_dtors driver at all — this is an lld-linked clang-CRT image; instead two custom byte-guarded stubs, __do_init @ 0xe63c000 (.init_array slot 761, not slot 0) and __do_fini @ 0xe63c020 (.fini_array slot 0), are themselves ordinary array slots among the rest. Third — the whole reason the design holds together — the constructors only register; they run nothing order-critical. Every cross-TU ordering hazard is deferred to the GoogleInitializer DAG, which runs later in topological order (PHASE B, at first PJRT_Plugin_Initialize), so the link order the constructors execute in never decides correctness.
The page is laid out as the two halves of the runtime's lifetime: §1 the load-time constructor walk (what .init_array calls, what the _GLOBAL__sub_I_* set constructs, the __cxa_guard singleton discipline, the ordering guarantees and where they are weak), and §2 the unload-time teardown (__do_fini, the __cxa_finalize(__dso_handle) LIFO drain, and what is not torn down because it is leaked-on-exit). The ELF DT_INIT/DT_FINI/PREINIT_ARRAY tags and the linker's init_proc trampoline live on elf-entry-and-init-proc.md; the GoogleInitializer module DAG and the PJRT_Plugin_Initialize bootstrap live on module-init-plugin-discovery.md and tftpu-initialize-bootstrap.md.
For reimplementation, the static-init/fini contract is:
- The constructor walk model —
.init_arrayis an array of function pointers the linker calls in array order; slot 0 is__cpu_indicator_init(CPU/IFUNC detection),__do_initis a guarded stub at slot 761, and the bulk are per-TU_GLOBAL__sub_I_*functions. Each constructs file-scope globals and, for any with a non-trivial destructor, registers teardown with__cxa_atexit(dtor, obj, &__dso_handle). - The register-only discipline — none of the ~2900 constructors runs order-critical TPU bring-up. They fill tables and build the
GoogleInitializerregistry; the order-sensitive work is deferred to the DAG. A reimplementer who runs HAL/platform setup inside a static ctor reintroduces the static-init-order fiasco the design exists to avoid. - The two byte guards —
__do_init's__do_init.__initializedand__do_fini's__do_fini.__finalizedmake the array-bracketing slots idempotent; the per-singleton__cxa_guardbytes make each function-local static one-shot and thread-safe. - The symmetric teardown —
__do_finicalls__cxa_finalize(__dso_handle), which drains the__cxa_atexitlist LIFO, calling each registered destructor once. The PJRT surface is deliberately not on this list — it is a leaked Meyers singleton.
| Constructor array | INIT_ARRAY @ 0x215f26f0 — 23200 B = 2900 × 8, all R_X86_64_RELATIVE |
| Constructor entry stub | __do_init @ 0xe63c000 — guard byte __do_init.__initialized @ 0x224c3880 |
| Destructor entry stub | __do_fini @ 0xe63c020 — guard byte __do_fini.__finalized @ 0x224c3881 |
| Per-TU ctor symbols | _GLOBAL__sub_I_<file>.cc/.cpp — 1885 distinct symbols (1764 distinct base names; 71 names recur across statically-linked components) |
| Grouped / single-var ctors | _GLOBAL__I_NNNNNN (759), __cxx_global_var_init[.N] (89 distinct names / 221 instances) |
| Function-local-static guards | libtpu's own __cxa_guard_acquire/release/abort @ 0x213e9ac0 / 0x213e9be0 / 0x213e9c20 |
| Teardown drain | __cxa_finalize(_dso_handle) (libtpu's own libc++abi) |
| Teardown array | FINI_ARRAY @ 0x215f8190 — 16 B = 2 × 8: __do_fini, rand_thread_state_clear_all @ 0x2063df60 |
| Confidence | CONFIRMED (byte-anchored vs decompile) unless a row or callout says otherwise |
1. Load-Time — the static-constructor walk
Purpose
.init_array is the Itanium-ABI mechanism by which every translation unit's file-scope dynamic initialization runs at dlopen. The dynamic linker reads DT_INIT_ARRAY/DT_INIT_ARRAYSZ (the section at 0x215f26f0, 2900 entries), then calls each pointer in array order after DT_INIT and PREINIT_ARRAY. For libtpu this is where the entire register-only landscape is built: 2900 constructor calls construct globals across abseil, protobuf, LLVM, MLIR, the TPU runtime, and the GoogleInitializer registry. It is the single largest event at load, and it is deliberately shallow — registration only.
Entry Point
dynamic linker (DT_INIT_ARRAY walk) ── one call per slot, array order
└─ INIT_ARRAY @ 0x215f26f0 (2900 slots, all R_X86_64_RELATIVE)
[0] __cpu_indicator_init 0x21211240 ── clang/GCC ifunc + CPU-feature detector (first; addend of slot-0 reloc @ 0x215f26f0)
[1] ARGV_INIT_ARRAY::init_wrapper 0x20a0d2b0 ── Rust runtime argv capture
... (slots 2..760 — early CRT/IFUNC + grouped C++ ctors) ...
[761] __do_init 0xe63c000 ── guarded array-bracket stub (no work; reloc @ 0x215f3eb8)
... (the long tail) ...
[n] _GLOBAL__sub_I_<file>.cc × 1885 ── per-TU static init (1764 distinct base names)
_GLOBAL__I_NNNNNN × 759 ── grouped / priority-tagged ctors
__cxx_global_var_init[.N] × 221 ── single-global inits (89 distinct names)
NOTE — the relocation step is part of elf-entry-and-init-proc.md: all 2900 in-file slots are zero on disk, and every slot is an
R_X86_64_RELATIVEthe linker fills with the target VA before the walk begins. This page assumes the array is already relocated and concerns itself only with what the targets do when called.
Algorithm
The linker's loop is trivial; the interesting logic is inside each per-TU function. A representative _GLOBAL__sub_I_* constructs its file-scope globals and, for any global with a non-trivial destructor, registers a teardown callback — the symmetric half that §2 drains. The _GLOBAL__sub_I_tpu_platform_registration.cc @ 0x2121f040 body is the canonical "register-only" shape: it constructs a single GoogleInitializer object and returns.
// the linker's array walk (conceptual; not a libtpu function)
function run_init_array():
for i in 0 .. DT_INIT_ARRAYSZ/8 - 1: // 2900 iterations
(*INIT_ARRAY[i])() // call in ARRAY order
// __do_init @ 0xe63c000 — the guarded array-bracket stub
function __do_init():
if !__do_init.__initialized: // function-static byte guard
__do_init.__initialized = 1 // set-and-return; NO ctor body
// intentionally empty: real ctors are the OTHER array slots
// representative per-TU ctor — register-only, no order-critical work
// _GLOBAL__sub_I_tpu_platform_registration.cc @ 0x2121f040
function _GLOBAL__sub_I_tpu_platform_registration():
// construct ONE file-scope global: a GoogleInitializer descriptor
GoogleInitializer( // ctor @ 0x210b2780
&google_initializer_module_tpu_platform, // the .data object
"module", "tpu_platform", // tag + module NAME
&google_init_module_tpu_platform) // fn to RUN LATER (PHASE B)
// (other registration TUs additionally register FLAGS_tf_jf_* absl flags)
// the module body does NOT run here — only the descriptor is built
// the destructor-registration pattern every non-trivial global ctor uses
// (observed pervasively across the decompiled _GLOBAL__sub_I_* bodies)
function some_ctor_with_destructible_global():
construct_in_place(&GetThing()::thing) // placement-new the global
__cxa_atexit(&Thing::~Thing, // teardown callback ...
&GetThing()::thing, // ... bound to the object ...
&_dso_handle) // ... tagged to THIS image
// __cxa_atexit pushes onto libtpu's own __cxa_finalize LIFO list (§2)
QUIRK —
__do_initdoes nothing but flip a byte. It is not the dispatcher that calls the other constructors — glibc-style__do_global_ctors_auxwould be. Here__do_initis simply one more entry in.init_array(slot 761), sitting alongside the 1885 real_GLOBAL__sub_I_*slots. Its only job is to be idempotent: if the array were ever walked twice (re-dlopenof an already-mapped image), the guard makes the second walk of this slot a no-op. The real per-TU constructors carry their own__cxa_guardbytes for the same reason. A reimplementer who treats__do_initas the constructor driver will look for ctor calls inside it and find none — that is correct, not a decompiler failure.
What the _GLOBAL__sub_I_* set constructs
The 1885 _GLOBAL__sub_I_* constructors are not a flat list to enumerate — that is the anti-pattern. They are better understood by the kinds of registries they populate, all of which share one property: they register into a table or build a descriptor, and run no hardware or order-critical setup. The table below buckets the constructor set by what each TU's globals do, with the count of TUs matching each bucket (a keyword scan over the _GLOBAL__sub_I_*.cc/.cpp symbol set; buckets overlap, so they do not sum to 1885).
| Constructor bucket | What its globals register | Distinct TUs |
|---|---|---|
| TPU/XLA/TSL runtime | module descriptors, factory tables, runtime flags (the largest area) | ~162 (*tpu*) |
| LLVM target backends | *TargetMachine.cpp, *AsmPrinter.cpp, *ISelLowering.cpp, *Subtarget.cpp, *CodeGen* — RegisterTarget/RegisterPass into LLVM's global registries (X86, AArch64, AMDGPU, ARM, TPU) | ~51 |
GoogleInitializer module descriptors | the _GLOBAL__sub_I_*_registration.cc set — bind module NAME → google_init_module_* fn + dependency edges | ~41 (*registration*/*register*) |
| abseil flag registries | _GLOBAL__sub_I_absl_flags.cc, commandlineflags.cc, *_flags.cc — FLAGS_* into the absl flag registry | ~28 (*[Ff]lags*) |
| MLIR / HLO dialects + passes | mhlo, stablehlo, mlir_bridge_pass, dialect/pass registrations | ~14 |
| Metrics / telemetry | gauge/monitor/metric registries | ~19 |
| protobuf / upb descriptors | proto descriptor pools + the linkarr_upb_AllExts mini-table extension array (0x224c2480..0x224c2920) | ~11 named + linker array |
NOTE — the
GoogleInitializer-descriptor bucket (~41*registration*TUs) is the only one whose registrations are order-critical at run time, and it is precisely the one whose execution is deferred. The_GLOBAL__sub_I_*_registration.ccctors run at load (building descriptors), but thegoogle_init_module_*functions they point at run later, in the DAG, at firstPJRT_Plugin_Initialize. See module-init-plugin-discovery.md for the descriptor → run mapping.
Note: the symbol table (
nm -C libtpu.so) carries 1885 distinct_GLOBAL__sub_I_*symbols, each at its own address, and 759_GLOBAL__I_*symbols. The 1885 figure is distinct symbols; the 1764 figure is distinct base names — 71 names recur because the same source filename is statically linked from multiple components (metrics.ccappears 8×;trace_codec_factory.cc/performance_counters.cc/kernel_firmware_factory.cc/hardware_attributes_factory.cc6× each), and each recurrence is a genuinely distinct TU initializer at a distinct address. Ground these counts in the dedupednmsymbol table, not in a grep over a decompile tree (which inflates duplicate names). The 2900 total slot count is byte-anchored fromDT_INIT_ARRAYSZ(0x5aa0 / 8). The full census is on ../forensics/static-init.md.
The __cxa_guard singleton discipline
Function-local statics with non-trivial initialization (Meyers singletons) are made thread-safe and one-shot by a per-static guard word and the __cxa_guard_acquire/release/abort triple. libtpu links its own libc++abi implementation of these at 0x213e9ac0 / 0x213e9be0 / 0x213e9c20, not glibc's — confirmed by decompilation. The implementation is the standard libc++abi futex-backed guard: it CAS-installs an "in-progress" state, blocks contending threads on a futex syscall, and detects recursive initialization.
// __cxa_guard_acquire @ 0x213e9ac0 (libtpu's own libc++abi) — abbreviated
function __cxa_guard_acquire(guard):
if google_cxa_guard_acquire_begin: google_cxa_guard_acquire_begin(guard) // hook
if (guard->byte[0]) return 0 // already initialized → skip
prev = CAS8(&guard->byte[1], /*expect*/0, /*set*/2) // try to claim "in-progress"
if prev != 0: // someone else is initializing
loop:
if prev == 1: return 0 // became initialized → skip
tid = syscall(186 /*gettid*/)
if guard->owner_tid == tid: // SAME thread re-entered
__abort_message("__cxa_guard_acquire detected recursive "
"initialization: ...") // recursion → abort
mark waiter bit; syscall(202 /*futex*/ wait) // block until released
prev = CAS8(&guard->byte[1], 0, 2) // re-try claim on wake
guard->owner_tid = syscall(186 /*gettid*/) // record owner
return 1 // caller runs the initializer
GOTCHA — these are not glibc's
__cxa_guard_*. libtpu carries its own libc++abi (same image that carries its own__cxa_finalizeand__cxa_atexit), so the guard word is libtpu's two-byte{initialized, in-progress}layout and the recursion check usesgettiddirectly viasyscall(186). A reimplementer linking against the host libc's guard would get a different guard-word ABI; mixing the two on the same static is undefined. The 17GetTpuPjrtApiguards (Stage 2 of the lifecycle) and every Meyers singleton in the TPU runtime use this implementation — see get-pjrt-api-thunk.md for the 17-guard chain.
Ordering guarantees
The constructor walk gives exactly one ordering guarantee and no more: .init_array entries run in array order, which is the link order of their translation units. There is no cross-TU dependency ordering — if TU A's global depends on TU B's global being constructed, the only thing that makes it work is that the linker happened to place B before A. This is the classic static-initialization-order fiasco, and libtpu's design choice is to not rely on it for anything order-critical:
- Within a TU, declaration order is honored (standard C++).
- Across TUs, only link order is guaranteed.
__cpu_indicator_init(the clang/GCC ifunc + CPU-feature detector, slot 0) and the RustARGV_INIT_ARRAY::init_wrapper(slot 1) are placed first by the linker because nothing C++ may run before CPU-feature detection; the__do_initguard stub sits at slot 761, and the per-TU_GLOBAL__sub_I_*constructors fill the long tail. - For the order-critical TPU stack (HAL factories, XLA targets, the StreamExecutor platform), the constructors register a
GoogleInitializerdescriptor with explicit dependency edges and defer execution to the DAG. The DAG runs in topological order at PHASE B regardless of static-ctor order. This is why atpu_hal_jxc_hardware_implmodule can depend ontpu_halwithout any constraint on the link order of their_GLOBAL__sub_I_*_registration.ccfiles.
QUIRK — the reimplementation-critical inversion: the things you would expect to be order-critical (platform/HAL/target bring-up) are the things explicitly removed from static-init ordering, and the things that genuinely run at load (flag tables, descriptor pools, LLVM/MLIR registries) are order-insensitive by construction — each registers into an independent table keyed by name/ID, so the order they register in does not change the result. The design has hollowed out the static-init phase precisely so that its one weak guarantee (link order) never has to be relied upon.
2. Unload-Time — __do_fini and the __cxa_finalize drain
Purpose
At dlclose or process exit, the C runtime must run the destructors that were registered during the constructor walk. The Itanium ABI mechanism is symmetric to __cxa_atexit: __cxa_finalize(dso_handle) drains the registered-destructor list in LIFO order, calling each callback registered against this DSO exactly once, then clears them. libtpu drives this through FINI_ARRAY, whose first slot is the guarded __do_fini stub. The teardown is deliberately thin — the constructors registered far fewer destructors than they ran constructors, because the largest objects (the PJRT surface, the extension chain) are intentionally leaked.
Entry Point
dynamic linker (DT_FINI_ARRAY walk, reverse of init)
└─ DT_FINI (.fini @ 0xe63553c) ── empty stub (sub/add/ret)
└─ FINI_ARRAY @ 0x215f8190 (2 slots, R_X86_64_RELATIVE)
[0] __do_fini 0xe63c020 ── guarded __cxa_finalize(_dso_handle)
[1] rand_thread_state_clear_all 0x2063df60 ── per-thread BoringSSL/RNG cleanup
NOTE — the array bracket is asymmetric to init in one way:
FINI_ARRAYhas only 2 slots versusINIT_ARRAY's 2900, because per-TU teardown is not a_GLOBAL__sub_D_*array. Instead, every destructor was registered dynamically with__cxa_atexitduring the constructor walk, and the single__do_finislot drains all of them through__cxa_finalize. The linker walksFINI_ARRAYin reverse slot order, sorand_thread_state_clear_all(slot 1) runs before__do_fini(slot 0).
Algorithm
__do_fini is the mirror of __do_init: a guard byte plus, this time, a real call. The guard makes the drain one-shot; the body calls __cxa_finalize(_dso_handle) guarded by a weak-symbol presence check.
// __do_fini @ 0xe63c020 — the guarded teardown stub (byte-exact from decompile)
function __do_fini():
int result // uninitialized return (see GOTCHA)
if !__do_fini.__finalized: // function-static byte guard
__do_fini.__finalized = 1 // set BEFORE the call → reentrancy-safe
if &_cxa_finalize: // weak-symbol presence check
return __cxa_finalize(_dso_handle) // drain THIS image's atexit LIFO
return result
// __cxa_finalize(dso) — libtpu's own libc++abi (conceptual, standard Itanium drain)
function __cxa_finalize(dso):
// walk the __cxa_atexit list NEWEST-FIRST (LIFO)
for entry in reverse(cxa_atexit_list):
if dso == NULL or entry.dso_handle == dso: // only THIS image's dtors
d = entry.dtor; entry.dtor = NULL // mark consumed (one-shot)
d(entry.obj) // run the destructor
// entries are cleared so a second finalize is a no-op
GOTCHA — the
int resultin__do_finiis read uninitialized on the already-finalized path (__do_fini.__finalizedalready 1) and on the no-__cxa_finalizepath. This is a decompiler artifact of avoid-semantics tail-call function whose return register is simply not written on those paths — the caller (the linker's fini walk) ignores the return value, so the garbageeaxis harmless. A reimplementer should model__do_finias returningvoid; do not propagate the spuriousint.
QUIRK —
__do_finiset-then-checks: it writes__do_fini.__finalized = 1before calling__cxa_finalize, so if a registered destructor (running inside__cxa_finalize) somehow re-enters__do_fini, the guard is already set and the re-entry is a no-op. Theif (&_cxa_finalize)weak-symbol check is the standardcrtstuffguard for the case where the image was linked without a finalizer; in libtpu the symbol is always present (libtpu carries its own), so the branch is effectively always taken. Both details mirror glibc's__do_global_dtors_aux, but the__cxa_finalizehere is libtpu's internal one, draining libtpu's private atexit list.
What is NOT torn down
The teardown is thin by design. The largest and most expensive objects built during the lifecycle are leaked-on-exit function-local statics, the normal Meyers-singleton lifetime for a plugin .so — they are never registered with __cxa_atexit, so __cxa_finalize never touches them:
| Object | Storage | Torn down at exit? | How it is actually released |
|---|---|---|---|
GetTpuPjrtApi()::pjrt_api (140-slot table) | .lbss @ 0x227BA840 | No — leaked | never; process death reclaims it |
The 16 .bss extension nodes | .bss @ 0x224c3880+ | No — leaked | never; process death reclaims it |
xla::PjRtClient / TpuPlatform / executors | heap | No at exit | explicit PJRT_*_Destroy C-API calls |
| Per-thread BoringSSL/RNG state | TLS | Yes | rand_thread_state_clear_all @ 0x2063df60 (FINI slot 1) |
Globals with non-trivial dtors (caches, flag stores, APFloat constants, StringMaps) | .data/.bss | Yes | __cxa_finalize LIFO drain (FINI slot 0 → __do_fini) |
libtpu also provides its own atexit / __cxa_thread_atexit shims (0x21217360 / 0x2120f1e0) and a threadlogger::FlushLogsAtExit @ 0x20f3dfe0 for log flushing at exit. These feed the same LIFO list that __cxa_finalize drains.
GOTCHA — a reimplementer cannot assume
dlclosefrees the PJRT surface. ThePJRT_Apitable at0x227BA840and the extension chain are leaked Meyers singletons: their destructors were never registered, so__cxa_finalizedoes not call them, and a host thatdlopens, uses,dlcloses, and re-dlopens libtpu in the same process will find the table already built on the second load (the__cxa_guardbytes in.bsssurvive because the image stays resident under PJRT's reference-counted plugin lifetime). Clients, executables, and buffers must be released through their explicitPJRT_*_Destroycalls before unload; nothing at fini does it for you.
Related Components
| Component | Relationship |
|---|---|
INIT_ARRAY @ 0x215f26f0 | The 2900-slot constructor array the linker walks at dlopen |
__do_init @ 0xe63c000 | Guarded array-bracket stub at .init_array slot 761, sets __do_init.__initialized |
_GLOBAL__sub_I_* (1885 symbols) | The per-TU static-init functions that do the actual registration |
GoogleInitializer ctor @ 0x210b2780 | Constructed by *_registration.cc ctors; binds module name → run-later fn |
__cxa_guard_acquire/release/abort @ 0x213e9ac0 / 0x213e9be0 / 0x213e9c20 | libtpu's own libc++abi function-local-static guards |
__cxa_atexit / _dso_handle | The destructor-registration call every non-trivial global ctor emits |
FINI_ARRAY @ 0x215f8190 | The 2-slot teardown array (__do_fini, rand_thread_state_clear_all) |
__do_fini @ 0xe63c020 | Guarded __cxa_finalize(_dso_handle) — drains the atexit LIFO |
rand_thread_state_clear_all @ 0x2063df60 | FINI slot 1 — per-thread BoringSSL/RNG cleanup |
Cross-References
- overview.md — the full load-to-unload timeline; this page owns Stage 0's constructor walk and Stage 5's teardown
- elf-entry-and-init-proc.md — the ELF
DT_INIT/DT_FINI/PREINIT_ARRAYtags, array relocation, and theinit_procCRT trampoline that drives.init_array - module-init-plugin-discovery.md — what the
*_registration.ccctors register (theGoogleInitializerdescriptors) and how the DAG runs them at PHASE B - tftpu-initialize-bootstrap.md — Stage 3, where the deferred
google_init_module_*functions the constructors only registered finally execute - get-pjrt-api-thunk.md — the 17
__cxa_guardMeyers-singleton builders that use the same libc++abi guard implementation documented here