Compilation Cache
Addresses, build-id, and symbol names apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d). Other versions differ; treat every VA as version-pinned.
Abstract
Compiling a TPU program is expensive — the full HLO→MLIR→LLO descent, MSA, scheduling, and bundle packing run end to end (see compile-phases.md). libtpu therefore never recompiles a program it has already built if it can recognize the request as identical. Recognition is a cache-key problem: the runtime must reduce the entire compilation request — the program text, the compiler flags, and the physical target — to a short comparable token, look that token up, and on a miss compile once and store the result under it. This page owns that machinery: what hashes into the key, the lookup/store paths, and the cache-entry lifecycle. The serialized executable that ends up inside a hit entry — the TpuExecutable / TpuProgram wire format — is owned by tpu-program-serialization.md and is treated here as an opaque payload.
There are three distinct cache layers, and conflating them is the first reimplementation trap. The outermost is the XLA JIT cache, tensorflow::DeviceCompilationCache<T> (from device_compilation_cache.h): an in-process flat_hash_map keyed by a DeviceCompilationClusterSignature — a function name plus the canonicalized argument shapes/values — that decides whether an XLA cluster needs (re)compilation at all. Beneath it sits the TPU C-API key, tensorflow::tpu::TpuCompilationCacheKey, built by TpuCompile_CreateCompilationCacheKey (0xf6a2080) in tpu_util_c_api.cc: a Fingerprint2011 digest over a StrCat-assembled string of thirteen named inputs (HLO/MLIR fingerprints, replica count, topology bounds, guaranteed-constants size, shapes prefix, embedding-partitions fingerprint) plus a conditional device-assignment tail. The innermost is the TFRT group cache, tensorflow::tfrt_tpu::TpuCompilationCache (tpu_compilation_cache.cc): the layer that actually holds compiled programs, resolves a key through a three-tier lookup (in-memory → coordination service → on-disk persistent cache), and manages entry refcounts and eviction.
The page is organized by layer, outermost first. Each layer gets the same ### Purpose / ### Cache Key / ### Lookup & Store / ### Algorithm / ### Function Map grammar, so a reader can compare the three keys side by side. The closing units cover the persistent on-disk path format and the entry lifecycle (refcount, eviction, restore-from-serialized).
For reimplementation, the contract is:
- The TPU cache key is a
Fingerprint2011over a deterministic, ordered string of thirteen named fields plus a conditional device-assignment tail. Reproduce the field set, the concatenation order (which differs from thedump_varslabel order —shapes_prefixis appended last), the separators (:and,), and the two-stage hash (fingerprint the embedding-partitions proto first, then fingerprint the whole prefix). Any field omitted or reordered silently changes the key and causes spurious misses. - The XLA-side key is a signature, not a hash —
DeviceCompilationClusterSignaturestores the function name plus a per-argument canonical form (constant tensors carried by value, non-constants carried as(dtype, shape)), compared by a customHash/==. It gates whetherTpuCompile_CreateCompilationCacheKeyis even called. - Lookup is three-tiered.
TpuCompilationCache::LookUpInternal(0xf7a72c0) tries the in-memory map first, then the coordination service, then the persistent on-disk cache; a miss at every tier is the only path that triggers a real compile. - Entries are reference-counted and LRU-evictable. The cache holds
CompiledSubgraph/TpuCompilationCacheEntryobjects with explicitRelease/DiscardEntryRef/MarkOldestEntryForEviction; an executing program pins its entry. - The persistent path is a filesystem prefix, not a hash directory:
JoinPath(dir, "CL" + guaranteed_const_fp + "_" + base_key), built byConstructCacheEntryFilepathPrefix(0xe8cab00).
| TPU key builder | TpuCompile_CreateCompilationCacheKey — 0xf6a2080 (tpu_util_c_api.cc:149-193) |
| Key digest | Fingerprint2011 (0x20d6cd40) → decimal string; two-stage |
| XLA signature | tensorflow::DeviceCompilationClusterSignature::Build — 0xfb01f40 |
| XLA in-mem cache | DeviceCompilationCache<T>::{LookupOrCreate,Store} — 0xe9932e0 / 0xe994a60 |
| TFRT group cache | tensorflow::tfrt_tpu::TpuCompilationCache (tpu_compilation_cache.cc) |
| Three-tier lookup | TpuCompilationCache::LookUpInternal — 0xf7a72c0 |
| Persistent path | ConstructCacheEntryFilepathPrefix — 0xe8cab00 ("CL" + fp + "_" + key) |
| Cache entry type | tfrt::tpu::TpuCompilationCacheEntry (ctor 0xf7bd100) |
| Source files | tpu_util_c_api.cc, device_compilation_cache.h, tpu_compilation_cache.cc |
Layer 1 — XLA Cluster Signature (DeviceCompilationCache)
Purpose
This is the outermost gate, and the one a JAX/XLA caller hits first. When an XLA cluster (an HLO computation derived from one NameAttrList plus its argument list) is about to be compiled, DeviceCompilationCache<T> decides whether an equivalent cluster has already been compiled in this process. It is templated on the executable type — both xla::PjRtLoadedExecutable (0xe9932e0) and xla::LocalExecutable (0xe986a20) instantiations are present in the binary, sharing identical LookupOrCreate/Store code. It is a pure in-memory cache from third_party/tensorflow/compiler/jit/device_compilation_cache.h (the Store log site cites device_compilation_cache.h:259); it has no persistence of its own and delegates the actual compile to DeviceCompiler<T>::CompileStrict/CompileAsynchronous.
Cache Key
The key is a tensorflow::DeviceCompilationClusterSignature, built — not hashed — into a structured value by DeviceCompilationClusterSignature::Build (0xfb01f40):
// DeviceCompilationClusterSignature::Build (sub_0xfb01f40)
// canonical_function : a DeviceCompilationCanonicalFunction (name + attrs)
// args : Span<const XlaArgument>
function Build(canonical_function, args):
sig.name = canonical_function.name // copied string, +0..+15
AppendArguments(sig, args) // sub_0xfb01a80
return sig
AppendArguments (0xfb01a80) walks each XlaArgument and appends a std::variant: a compile-time-constant argument is captured by value as a tensorflow::Tensor (so two calls with different constant values get different signatures), while a non-constant argument is captured as a pair<DataType, InlinedVector<int64,4>> — its dtype and shape only. The signature is compared by DeviceCompilationClusterSignature::Hash and equality, which is why the map below is a FlatHashMapPolicy<DeviceCompilationClusterSignature, unique_ptr<Entry>>, Hash> rather than a string-keyed map.
GOTCHA — the signature is value-sensitive for constants and shape-sensitive for variables. A reimplementation that keys only on shapes will incorrectly share entries across calls whose constant arguments differ (e.g. a
slicestart index folded as a guaranteed constant). Capturing the fullTensorfor constants is deliberate and load order matters:AppendArgumentspreserves argument order.
Lookup & Store
LookupOrCreate (0xe9932e0) and Store (0xe994a60) both run under a single absl::Mutex guarding the map. LookupOrCreate emplaces a fresh empty Entry (allocated operator new(0x30), 48 bytes) only if the signature is absent, then returns a snapshot of the entry's current state. Store re-finds the entry by signature and writes back whichever of the four optional fields the compiler produced.
// Entry layout (48 bytes, operator new(0x30); offsets from the Entry base, byte-confirmed in Store)
// +0 : absl::Mutex mu (per-entry lock; Store locks Entry+0)
// +8 : DeviceCompileState compile_state (DWORD; kUncompiled / kCompiling / kCompiled)
// +16 : int64 request_count (bumped on each LookupOrCreate; init 0)
// +24 : absl::Status compilation_status (StatusRep*, refcounted; init 1 = inline OK)
// +32 : XlaCompilationResult* compilation_result (owned; deleted on overwrite)
// +40 : T* executable (owned; vtable-deleted on overwrite)
// DeviceCompilationCache<T>::LookupOrCreate (sub_0xe9932e0)
function LookupOrCreate(signature) -> EntrySnapshot:
lock(mutex_)
entry = map_.try_emplace(signature, new Entry{}) // EmplaceDecomposable
entry.request_count += 1 // +16
snapshot.compile_state = entry.compile_state // +8
snapshot.status = entry.compilation_status // +24, Ref()'d
snapshot.request_count = entry.request_count
snapshot.{result,exe} = entry.{result,exe} // +32,+40 (aliased, not owned)
unlock(mutex_)
return snapshot
// DeviceCompilationCache<T>::Store (sub_0xe994a60)
function Store(signature, opt_state, opt_status, opt_result, opt_exe):
lock(mutex_)
entry = map_.try_emplace(signature, new Entry{}) // re-find or create
unlock(mutex_)
lock(entry.mutex) // per-entry mutex at Entry+0
if opt_state.has_value(): entry.compile_state = *opt_state
if opt_status.has_value(): entry.compilation_status = *opt_status // swap+Unref old
if opt_result.has_value(): delete entry.result; entry.result = release(*opt_result)
if opt_exe.has_value(): vtable_delete entry.exe; entry.exe = release(*opt_exe)
unlock(entry.mutex)
VLOG(4) << "Added/updated cache entry: key=" << sig.HumanString()
<< ", entry=" << entry.DebugString()
QUIRK —
Storetakes fourstd::optionalfields and updates only the ones present. The compiler first storescompile_state = kCompiling(claiming the slot), and only later storescompilation_result+executable+compile_state = kCompiled. A concurrentLookupOrCreatethat observeskCompilingis expected to wait on the status future rather than recompile — the cache itself does not block, it surfaces the state.
Function Map
| Function | Address | Role |
|---|---|---|
DeviceCompilationClusterSignature::Build | 0xfb01f40 | Build name+args signature |
AppendArguments | 0xfb01a80 | Append per-arg variant (const→Tensor, var→dtype/shape) |
DeviceCompilationCache<PjRt…>::LookupOrCreate | 0xe9932e0 | Emplace + snapshot under mutex |
DeviceCompilationCache<PjRt…>::Store | 0xe994a60 | Write-back optional fields |
DeviceCompilationCache<LocalExe>::LookupOrCreate | 0xe986a20 | Same, LocalExecutable instantiation |
DeviceCompilationCache<LocalExe>::Store | 0xe988d00 | Same, LocalExecutable instantiation |
DeviceCompiler<…>::CompileStrict | 0xe993860 / 0xe986fa0 | Miss path: do the real compile |
DeviceCompiler<…>::CompileAsynchronous | 0xe993420 / 0xe986b20 | Async miss path |
Layer 2 — TPU Compilation Cache Key
Purpose
When Layer 1 misses and the compiler reaches the TPU backend, the request is reduced to a tensorflow::tpu::TpuCompilationCacheKey — the token under which the TFRT group cache (Layer 3) and the persistent on-disk cache are indexed. The single builder is TpuCompile_CreateCompilationCacheKey (0xf6a2080), exported through the C-ABI shim layer in learning/45eac/tfrc/executor/stream_executor/tpu_util_c_api.cc (the LOG(FATAL) and VLOG sites cite lines 149, 152, 178, 193). It is the most important function on this page, because its field set is the cache's notion of program identity.
Cache Key
The key is a decimal string: the Fingerprint2011 of a StrCat-assembled prefix string, itself a to_string of that 64-bit fingerprint. The builder carries a 13-entry dump_vars label table for its VLOG dump (tpu_util_c_api.cc, the $_1 lambda, dump_vars<13ul, 13ul, …>); those thirteen labels name the program-identity inputs, and a fourteenth (device_assignment) is appended conditionally but is not in the dump table. The table below lists the inputs by name; the # column is the dump-table label order, which is not identical to the concatenation order (see the note after the pseudocode).
| # | Field (dump_vars label) | Source |
|---|---|---|
| 1 | function_name | std::string(property.function_name) |
| 2 | function_library_fingerprint | property.function_library_fingerprint |
| 3 | mlir_module_fingerprint | property.mlir_module_fingerprint |
| 4 | num_replicas | property.num_replicas |
| 5–7 | chip_bounds.{x,y,z} | topology.chip_bounds().{x,y,z} (topo+0x58 region) |
| 8–10 | wrap.{x,y,z} | topology.wrap().{x,y,z} (topo+0xA0/+0xA2) |
| 11 | shapes_prefix | std::string(property.shapes_prefix) |
| 12 | guaranteed_constants_size | property.guaranteed_constants_size |
| 13 | embedding_partitions_fingerprint | Fingerprint2011(embedding_partitions_proto) |
| — | device_assignment (conditional, not in dump table) | :device_assignment: + joined ids, or :default_device_assignment |
// TpuCompile_CreateCompilationCacheKey (sub_0xf6a2080, tpu_util_c_api.cc:149)
// property : compilation request (function name/lib fp, mlir fp, replicas, shapes, …)
// mesh : TpuMeshCommonState* (carries topology + embedding partitions)
function CreateCompilationCacheKey(property, mesh) -> TpuCompilationCacheKey:
CHECK(mesh != nullptr) // line 149
// (a) fingerprint the embedding-partitions proto first
eb_str = mesh.embedding_partitions_proto().SerializeToString() // CHECK ok, line 152
eb_fp = Fingerprint2011(eb_str)
topo = mesh.tpu_topology()
// (b) assemble the ordered prefix string
// (StrCat<...,int,...,int,...,int,...,int,...,bool,...,bool,...,bool,
// ...,char const*,...,string> in the decompile)
prefix = StrCat(property.function_name, // field 1 (string, v103/a10)
":", property.function_library_fingerprint, // 2 (FastIntToBuffer)
":", property.mlir_module_fingerprint, // 3 (FastIntToBuffer)
":", property.num_replicas, // 4 (int, a16)
":", topo.chip_bounds.x, ",", .y, ",", .z, // 5-7 (3×int, topo+0x58 region)
",", topo.wrap.x, ",", .y, ",", .z, // 8-10 (3×bool, topo+0xA0/+0xA2)
":", property.guaranteed_constants_size, // 12 (int, a9)
":", to_string(eb_fp)) // 13 (embedding fp, v87)
// (c) device-assignment tail (field 14): explicit ids vs. default
if device_grid_matches_replicas(property): // line ~220 shape check
if property.device_assignment == nullptr:
StrAppend(prefix, ":default_device_assignment")
else:
StrAppend(prefix, ":device_assignment:", Join(ids, ","))
StrAppend(prefix, property.shapes_prefix) // field 11 (string, a8) — appended last
VLOG(1) << "prefix_raw = " << prefix // line 178
// (d) the key proper is the fingerprint of the whole prefix
key.fingerprint = to_string(Fingerprint2011(prefix))
key.prefix = prefix
VLOG(1) << "CompilationCacheKey:" << key // line 193
return key // {fingerprint, prefix} pair
CONCATENATION ORDER — the
dump_varslabel table is the builder's display order, which is not the concatenation order. From the decompile, the mainStrCatrunsfunction_name, then:-joinedfunction_library_fingerprint,mlir_module_fingerprint,num_replicas, then,-joinedchip_bounds.{x,y,z}andwrap.{x,y,z}, then:-joinedguaranteed_constants_sizeand the embedding-partitions fingerprint. The device-assignment tail isStrAppended next, andshapes_prefixisStrAppended last of all (it is the finalStrAppendafter the device-assignment block). A reimplementation must match this concatenation order, not the dump-label order, or the fingerprint diverges.The exact int-vs-bool rendering of individual
chip_bounds/wrapcomponents (the decompiledStrCattemplate instantiates fourintand threeboolAlphaNum slots) is a Hex-Rays artifact and was not pinned per-component; the field names and the overall order above are byte-anchored to thedump_varslabels and theStrCat/StrAppendcontrol flow.
The guaranteed-constants contribution is itself a fingerprint, computed separately by TpuCompile_CreateGuaranteedConstFingerprint (0xf6a2040) and combined with FingerprintCat2011:
// TpuCompile_CreateGuaranteedConstFingerprint (sub_0xf6a2040)
function CreateGuaranteedConstFingerprint(running_fp, const_bytes) -> uint64:
return FingerprintCat2011(running_fp, Fingerprint2011(const_bytes))
GOTCHA — the key folds both a content fingerprint (HLO/MLIR module fp, field 3; constants size, field 12; embedding-partitions fp, field 13) and the physical target (topology chip bounds + wrap, fields 5–10; the device-assignment tail). Two byte-identical programs compiled for different topologies or device assignments get different keys — correct, because the compiled
TpuProgramis target-specific. A reimplementation that keys on program text alone will serve a v4-2x2x1 binary to a v5e-2x2 request and crash at load.
QUIRK — the device-assignment tail is gated by a check at decompile line 220. The tail is appended when
(replicas.lo × replicas.hi == num_cores) || (replicas.lo ∉ {1, num_cores}), wherenum_coresis read fromtopology+0x80and the tworeplicashalves come from the 64-bit replica field; otherwise the explicit/default-assignment field is omitted entirely, not defaulted — so the key length is variable. Reproduce the predicate exactly, or single-replica and multi-replica runs of the same program will collide or diverge unexpectedly. (The grid/replica field identities here are inferred from the arithmetic shape of the guard, not from labeled symbols — MEDIUM confidence on which field is which; the predicate structure itself is byte-anchored.)
Function Map
| Function | Address | Role |
|---|---|---|
TpuCompile_CreateCompilationCacheKey | 0xf6a2080 | Assemble prefix + fingerprint → key |
TpuCompile_CreateGuaranteedConstFingerprint | 0xf6a2040 | FingerprintCat2011(fp, Fingerprint2011(const)) |
TpuCompile_DestroyCompilationCacheKey | 0xf6a2e60 | Free the key struct |
Fingerprint2011 | 0x20d6cd40 | 64-bit non-crypto digest of a string_view |
FingerprintCat2011 | 0x20d6d0e0 | Combine two 64-bit fingerprints |
TpuProgram_GetFingerprint | 0xe8bed60 | Read fingerprint off a compiled TpuProgram (+160→+648) |
TpuExecutable_Fingerprint | 0xeabea40 | Fingerprint of a serialized executable |
NOTE —
Fingerprint2011is the Farmhash-derived TF fingerprint, not a cryptographic hash. The cache trusts it for identity; an adversary could in principle force a collision, but the threat model is recompilation-avoidance, not integrity.TpuProgram_GetFingerprintreads the fingerprint stored inside an already-compiled program (the inline-string-optimized field atprogram+160 → +648), used to validate a deserialized program against its key on load.
Layer 3 — TFRT Group Cache (tfrt_tpu::TpuCompilationCache)
Purpose
This is the layer that actually stores compiled programs and decides compile-vs-reuse at runtime. tensorflow::tfrt_tpu::TpuCompilationCache (learning/45eac/tfrt/tf_tpu/tpu_compilation_cache.cc) keys on the Layer-2 key, holds tfrt::tpu::TpuCompilationCacheEntry objects grouped by program group, and resolves a lookup through three tiers in priority order. A "group" is a set of related entries (e.g. the per-core programs of one sharded computation) inserted and evicted together; EmplaceGroupEntry/RemoveGroup/RemoveGroupEntries operate at group granularity, while individual entries carry their own refcounts.
Cache Key
Layer 3 indexes on the Layer-2 fingerprint string, but first normalizes it with ConvertToBaseKey (0xf7bf760), which strips the per-core suffix so all cores of a group share one base key. The persistent on-disk index uses a further-derived path prefix (see Persistent On-Disk Cache below).
Lookup & Store
The hit path is LookUpInternal (0xf7a72c0), a three-tier resolver:
// TpuCompilationCache::LookUpInternal (sub_0xf7a72c0, tpu_compilation_cache.cc:~1097)
function LookUpInternal(base_key) -> AsyncValueRef<Entry>:
// TIER 1 — in-memory map (shared lock; walk tombstone chain)
lock_shared(cache_mutex_)
node = entries_head_; while node.flags & 3: node = node.next // skip dead slots
result = GetTpuCompilationCacheEntry(node) // hit if present
unlock_shared(cache_mutex_)
if result.is_present: return result
// TIER 2 — coordination service (multi-host: peer may already hold it)
if coord_service_ && coord_service_.is_available():
return LookUpWithCoordService(base_key)
// TIER 3 — persistent on-disk cache
if persistent_cache_enabled_: // entries_[20]
st = LoadGroupFromPersistentCache() // sub: read RecordReader
if st == kOk:
node = entries_head_; while node.flags & 3: node = node.next
return GetTpuCompilationCacheEntry(node) // now in-memory
else:
return MakeErrorAsyncValueRef(st) // propagate load error
// total miss with no persistence → hard error (the compile path runs elsewhere)
return MakeErrorAsyncValueRef(
Internal("Cannot access the coordination service or persistent "
"cache to fetch the compilation cache entry.")) // line 1097
The store/compile path is CompileGroup (0xf7a4f60) → MaybeCompileGroup (0xf7a6200) → EmplaceGroupEntry (0xf7a3a00). MaybeCompileGroup is the dedup point: it re-checks the cache under an exclusive lock before launching a compile, so two concurrent first-time requests for the same key compile once and share the result. RestoreFromSerialized (0xf7a2ca0) is the persistent-cache loader's counterpart — it reconstructs entries from a serialized blob (TpuCompilationEntrySource) rather than compiling.
QUIRK — the in-memory tier walks an intrusive node chain (
node = node.nextat +16) and skips any node whose low two flag bits at +4 are set. Those bits mark tombstoned / being-evicted entries; a naiveflat_hash_map::findreimplementation will return an entry that is mid-eviction. The shared-lock walk plus flag check is the correct read protocol.
Function Map
| Function | Address | Role |
|---|---|---|
TpuCompilationCache::LookUpInternal | 0xf7a72c0 | Three-tier lookup (mem→coord→disk) |
TpuCompilationCache::LookUpWithCoordService | 0xf7a74a0 | Tier-2 multi-host lookup |
TpuCompilationCache::LoadGroupFromPersistentCache | 0xf7a7de0 | Tier-3 on-disk read |
TpuCompilationCache::CompileGroup | 0xf7a4f60 | Miss path: compile a group |
TpuCompilationCache::MaybeCompileGroup | 0xf7a6200 | Re-check under lock, dedup compiles |
TpuCompilationCache::EmplaceGroupEntry | 0xf7a3a00 | Insert compiled group into map |
TpuCompilationCache::GetEntriesFrom | 0xf7a1f80 | Resolve group → per-core entries |
TpuCompilationCache::RemoveGroup | 0x21389f60 | Drop a group |
TpuCompilationCache::RemoveGroupEntries | 0xf7a85c0 | Drop N entries of a group |
TpuCompilationCache::RestoreFromSerialized | 0xf7a2ca0 | Rebuild entries from serialized blob |
TpuCompilationCache::ConvertToBaseKey | 0xf7bf760 | Strip per-core suffix → group base key |
tfrt::tpu::TpuCompilationCacheEntry::Program ctor | 0xf7bc500 | Build the Program payload of an entry |
tfrt::tpu::TpuCompilationCacheEntry ctor | 0xf7bd100 | Build a full cache entry |
Persistent On-Disk Cache
The persistent cache turns a TpuCompilationCacheKey into a filesystem path prefix, not a hash-bucket directory. ConstructCacheEntryFilepathPrefix (0xe8cab00, in an anonymous namespace) builds it:
// ConstructCacheEntryFilepathPrefix (sub_0xe8cab00)
// dir : persistent cache directory (string_view)
// key : TpuCompilationCacheKey
// topology : tpu::TpuTopology
// guaranteed_fp : int64 guaranteed-const fingerprint
function ConstructCacheEntryFilepathPrefix(dir, key, topology, guaranteed_fp) -> path:
base = ConstructPersistentCompilationCacheKey(key, topology) // derived stable key
file = JoinStrings({ "CL", to_string(guaranteed_fp), base }, "_") // "CL<fp>_<base>"
return JoinPath(dir, file) // tsl::io::JoinPathImpl
The on-disk record itself is read/written by tfrt_tpu::RecordReader (ctor 0xf7beda0) and RecordWriter (ctor 0xf7be740) — the TFRecord-style framed-blob format also used for the serialized TpuProgram payload. The cache directory and its read-only / read-write mode are governed by absl flags (the binary carries a tfrt_tpu_pjrt_client_* flag family); the exact flag names for the persistent directory were not individually traced (LOW confidence on flag identity), but the path-prefix construction above is byte-confirmed.
NOTE — the
"CL"literal is the persistent-cache record-type tag (the file is a compiled program). The guaranteed-const fingerprint is hoisted into the filename — separate from the in-key field 12 (size) — so that a change in constant values (not just their byte size) produces a distinct on-disk file even when the in-memory key would otherwise look stable. This is the on-disk analogue of Layer 1 capturing constantTensors by value.
Entry Lifecycle
A cache entry is reference-counted across its whole life: created on a miss, pinned while a program executes, and evicted under LRU pressure. The TFRT layer's lifecycle primitives live on TpuCompilationCacheInterface:
LookUpInternal miss
│
▼
CompileGroup ──► MaybeCompileGroup (dedup) ──► compile ──► EmplaceGroupEntry
│ │
│ (or) RestoreFromSerialized / LoadGroupFromPersistentCache
▼ ▼
InsertEntry ───────────────────────────────────────► entry in map (refcount=1)
│
▼
execution pins entry ──► DoTpuExecute holds AsyncValueRef<TpuCompilationCacheEntry>
│
▼
Release(uid) / DiscardEntryRef ──► refcount-- ; if 0 and LRU-oldest:
│
▼
MarkOldestEntryForEviction ──► flag node (+4 bits) ; RemoveEntry frees payload
| Operation | Address | Role |
|---|---|---|
TpuCompilationCacheInterface::InsertEntry | 0xeaacde0 | Add a CompiledSubgraph to the cache |
TpuCompilationCacheInterface::RemoveEntry | 0xeaac840 | Remove by key |
TpuCompilationCacheInterface::Release | 0xeaabde0 | Drop a held reference by uid |
TpuCompilationCacheInterface::DiscardEntryRef | 0xeaac500 | Decrement entry refcount |
TpuCompilationCacheInterface::MarkOldestEntryForEviction | 0xeaac340 | LRU eviction candidate |
TpuCompilationCacheExternal::InitializeEntry | 0xe977c40 | Populate an entry via compile callback |
DoTpuExecute (consumes entry) | 0xe6fa5a0 | Holds AsyncValueRef<TpuCompilationCacheEntry> during run |
A tfrt::tpu::TpuCompilationCacheEntry (ctor 0xf7bd100) wraps an AsyncValueRef<Program> plus the metadata an executor needs without touching the program bytes: the HloInputOutputAliasConfig, the input/output xla::Shape vectors, the CompilerMetadata, the HostTransferProto list, an optional DeviceAssignment, the FDO config, and the PjrtExecutableContext. The Program itself (ctor 0xf7bc500) is the shared_ptr<const tpu::TpuCoreProgram> payload — the object that tpu-program-serialization.md documents.
GOTCHA — the entry holds the program as a
shared_ptr<const TpuCoreProgram>, andDoTpuExecutetakes anAsyncValueRefto the entry, not to the program. Eviction therefore cannot free a program that is mid-execution: the executing reference keeps the entry (and transitively the program) alive even afterMarkOldestEntryForEvictionflags it. A reimplementation that frees the program on eviction-flag rather than on last-ref will use-after-free under concurrent execute+evict.
How the Layers Compose
A single PJRT_Client_Compile call (the PJRT entry point; see client-and-device.md and the phased compile in ext-compile-phasecompile.md) threads through all three layers:
PJRT_Client_Compile(program, options)
│
├─ Layer 1: DeviceCompilationClusterSignature::Build(fn, args) // 0xfb01f40
│ LookupOrCreate(sig) ─ hit? ─► return cached executable
│ │ miss → compile_state=kCompiling, fall through
│ ▼
├─ Layer 2: TpuCompile_CreateCompilationCacheKey(property, mesh) // 0xf6a2080
│ key = to_string(Fingerprint2011(prefix_of_13_fields + device_assignment))
│ ▼
├─ Layer 3: TpuCompilationCache::LookUpInternal(ConvertToBaseKey(key)) // 0xf7a72c0
│ tier1 in-mem ─ tier2 coord-svc ─ tier3 persistent("CL<fp>_<base>")
│ │ all miss
│ ▼
│ CompileGroup → MaybeCompileGroup (re-dedup) → compile → EmplaceGroupEntry
│ (optionally write persistent record via RecordWriter)
│ ▼
└─ Layer 1: Store(sig, kCompiled, status, compilation_result, executable) // 0xe994a60
return PjRtLoadedExecutable
The redundancy is intentional: Layer 1 catches re-compiles of the same XLA cluster cheaply (no fingerprinting), Layer 2 produces the target-specific identity, and Layer 3 catches cross-process and cross-run reuse via coordination service and disk. Each layer's miss is the next layer's lookup.
Related Components
| Name | Relationship |
|---|---|
TpuExecutable / TpuProgram serialization | The opaque payload stored in a hit entry; wire format |
DeviceCompiler<T>::CompileStrict | The miss path Layer 1 delegates the real compile to |
Fingerprint2011 / FingerprintCat2011 | The hash primitives every key digest is built on |
| Coordination service | Tier-2 multi-host cache sharing in LookUpWithCoordService |
| PJRT client compile | The public entry point that drives all three layers |
Cross-References
- TpuProgram Serialization — the
TpuExecutable/TpuProgramwire format stored inside a cache hit; this page treats it as an opaque payload - Compile Phases — the expensive end-to-end pipeline a cache hit lets you skip
- Compiler Overview — the IR-layer stack the cached program is the output of
- Custom-Call Lowering —
tpu_custom_callcarries its ownMosaicMlirCacheEntrysub-cache, distinct from the program cache - PJRT Client and Device — the
PJRT_Client_Compileentry point that consults Layer 1 - PJRT Phased Compile Extension — the multi-phase compile API that re-serializes between phase boundaries
- Executable Execution —
DoTpuExecutepins a cache entry for the duration of a run