Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Per-Gen Accuracy Tables

All values, blob offsets, and constants on this page were decoded byte-exactly from the five embedded accuracy_table_*.binarypb blobs and the select_accuracy.cc consumer functions in libtpu.so from the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). Other versions differ.

Abstract

Each TPU generation ships an embedded accuracy_table_<codename>.binarypb proto blob that records the numerical accuracy of the hardware transcendental approximationssin, cos, tan, exp, log, rsqrt, sqrt, divide, erf, logistic, tanh, and their fast/accurate variants. These are not raw minimax polynomial coefficients; they are measured worst-case relative-error and ULP bounds, plus a piecewise input→error curve, that the compiler consults to decide whether the fast hardware approximation (the XLU/EUP path) meets the user-requested ResultAccuracy, or whether it must emit the slower accurate software polynomial.

Five blobs exist — jellyfish (TPU v2), dragonfish (TPU v3), pufferfish (TPU v4), viperfish (TPU v5), ghostlite (TPU v6 lite); external names are per TpuVersionToExternalName (see the codename matrix). The newest generation, 6acc60406 (external TPU7x), ships no accuracy table: its embed:// lookup fails and the impl-type selector falls through to a warning-driven fallback. (Trillium/Ironwood are marketing names with zero occurrences in the binary — 6acc60406 is the only internal name for this generation.) The first three generations carry byte-identical content (same 181,284-byte blob modulo the version stamp); viperfish and ghostlite each tighten the fast-path bounds by one to four orders of magnitude, in lockstep with the per-gen transcendental cycle-cost improvements.

The decode is reimplementation-grade: all five blobs were carved byte-exact from .lrodata and their md5 verified against the FileWrapper fingerprints, the proto schema was recovered from message typeinfo plus serializer wire tags, and the four-function consumer chain (GetTargetAccuracy → GetAccuracyEntry → CompareAccuracy → GetImplType) was disassembled.

Blobs5 × accuracy_table_<codename>.binarypb in .lrodata, md5-verified vs FileWrapper fp
Formatraw uncompressed serialized protobuf (xla.service.jellyfish.accuracy.AccuracyTable)
Entries / table19 (15 distinct op names; 6 ops carry a fast+accurate variant pair)
Gens with tablejellyfish, dragonfish, pufferfish, viperfish, ghostlite — not 6acc60406
Version stampproto field 3 = tpu::TpuVersion + 1 (1..5)
Consumerxla::jellyfish::{GetTargetAccuracy, GetAccuracyEntry, CompareAccuracy, GetImplType}
Source fileplatforms/xla/service/jellyfish/select_accuracy.cc
ConfidenceCONFIRMED unless a cell is annotated otherwise

Per-Gen Blob Mapping

Each blob is keyed by codename in its embed:// URI and stamped with proto field 3 = TpuVersion + 1. Two version numberings coexist and must not be conflated: the runtime tpu::TpuVersion (0-based, read from Target+0x398) and the proto-internal field-3 stamp (1-based).

codenametpu::TpuVersionproto f3blob filedata_va / offsize (B)md5 (== FileWrapper fp) · Conf
jellyfish01accuracy_table_jellyfish.binarypb0x040c8990181,284e3a2a768…d4c42cf4
dragonfish12accuracy_table_dragonfish.binarypb0x040703d0181,2844d2efd10…748bc019
pufferfish23accuracy_table_pufferfish.binarypb0x040f4dc0181,284e8dbf7b1…a185d4c4
viperfish34accuracy_table_viperfish.binarypb0x041211f0181,139354963c9…3746b900
ghostlite45accuracy_table_ghostlite.binarypb0x0409c800180,6178acb76b8…fb433612
6acc604065— (none)no table — fallback

The embed:// URI is assembled as embed://accuracy_table_<codename>_memfile/accuracy_table_<codename>.binarypb from the path pieces "embed://accuracy_table_" + codename + "_memfile/" + "accuracy_table_" + codename + ".binarypb"; the codename comes from tpu::TpuVersionToString(version) (string table @ 0x22011bf0, 0→jellyfish … 5→6acc60406, FATAL above 5).

GOTCHA — jellyfish, dragonfish, and pufferfish are byte-identical modulo the field-3 stamp (same 181,284 B, same per-op values, same 1,187-row exp curve) — their XLU/EUP units are the same design. A reimplementer can ship one table for v2/v3/v4 and only needs distinct data for viperfish and ghostlite.


The Proto Schema

message AccuracyTable {                  // the blob
  repeated AccuracyEntry entries = 1;     // tag 0x0a; 19 per table
}
message AccuracyEntry {
  string name           = 1;  // 0x0a  "cosine","exponential",…
  PrimitiveType type    = 2;  // 0x10  always 11 == F32
  int32  version        = 3;  // 0x18  TpuVersion + 1
  int32  variant        = 4;  // 0x20  0 = fast/NoEup, 1 = EUP/accurate
  double max_input      = 5;  // 0x29  domain upper bound
  double worst_rel_err  = 6;  // 0x31  worst-case relative error over domain
  double worst_ulp      = 7;  // 0x39  worst-case ULP over domain
  repeated ErrorInfo curve = 8; // 0x42 piecewise input→error breakpoints
}
message AccuracyEntry.ErrorInfo {
  double input   = 1;  // 0x09  input-magnitude breakpoint
  double rel_err = 2;  // 0x11  relative error guaranteed at/below
  double ulp     = 3;  // 0x19  ULP bound at/below
}

type (field 2) is uniformly 11 = F32 on every entry: the XLU computes in F32 internally regardless of bf16 I/O, so one entry governs the bf16-cast variant too. The curve (field 8) is monotone — as |input| grows away from 0/subnormal, the approximation's relative error shrinks; worst_rel_err/worst_ulp are the values at the smallest-input breakpoint.

The 19 entries (15 op names) are: trig cosine, sine, tan; exp family exponential (×2), exponential-minus-one (×2); log family log (×2), log-plus-one (×2); roots/division rsqrt, sqrt, divide; special/activation erf, logistic (×2), tanh (×2). The six dual-variant ops carry both a variant-0 (fast / "NoEup" XLU) and a variant-1 (high-precision "EUP") entry. These match the StableHLO ops that carry a result_accuracy attribute (vhlo.cosine_v2, vhlo.exponential_v2, …). There is no cbrt entry (binary string: "Accuracy analysis for cbrt is not implemented yet.").


Cross-Gen Accuracy Constants

Top-level worst-case relative error (field 6) per (op, variant) — lower = more accurate. j/d/p = jellyfish/dragonfish/pufferfish (identical). Confidence: every value below is CONFIRMED — read byte-exact from the field-6/field-7 doubles of the carved blobs (the data_vas in Per-Gen Blob Mapping); the per-cell source is the proto, not an inference.

opvarjf/df/pfviperfishghostlite
cosine02.38e-072.38e-072.38e-07
sine02.38e-072.38e-072.38e-07
tan04.77e-074.77e-074.77e-07
divide01.01.01.0
rsqrt02.38e-072.38e-071.19e-07
sqrt02.38e-072.38e-072.38e-07
erf04.77e-079.54e-072.38e-07
exponential0 / 11.0 / 1.01.0 / 1.01.0 / 1.0
exp-minus-one0 / 12.44e-04 / 1.02.44e-04 / 1.01.0 / 1.0
log0 / 14.88e-04 / 4.77e-073.82e-06 / 4.77e-072.38e-07 / 4.77e-07
log-plus-one0 / 14.88e-04 / 1.19e-072.44e-04 / 1.19e-072.44e-04 / 1.19e-07
logistic0 / 14.0 / 1.02.0 / 1.01.0 / 1.0
tanh0 / 11.22e-04 / 4.77e-077.63e-06 / 4.77e-071.19e-07 / 4.77e-07

Top-level worst-case ULP (field 7), same layout:

opvarjf/df/pfviperfishghostlite
cosine0444
sine0444
tan0888
divide0256641
rsqrt0221
sqrt0442
erf0881
exponential0 / 1inf / 1inf / 1inf / 1
exp-minus-one0 / 12048 / 1.68e72048 / 1.68e71.68e7 / 1.68e7
log0 / 14096 / 464 / 42 / 4
log-plus-one0 / 14096 / 14096 / 14096 / 1
logistic0 / 16.71e7 / 2563.36e7 / 1281.68e7 / inf
tanh0 / 12048 / 8128 / 81 / 8

NOTE — exponential variant-0 worst-ULP stays inf on every generation because the smallest breakpoint sits at a subnormal input where the relative error is 1.0 (100%). The fast exp is only usable above the subnormal range; the worst_* figures are deliberately the worst-case (smallest-input) values, not the typical accuracy.


Piecewise Error Curves (representative)

The field-8 curve gives per-input-magnitude accuracy. Selected jellyfish (= df = pf) fast-path (variant 0) transitions, by input threshold:

divide:      input>=0         rel=1.0       ulp=256    (subnormal: 100% error)
             input>=2.35e-38  rel=1.53e-05  ulp=128
             input>=4.70e-38  rel=1.19e-07  ulp=1      (normal range: near-exact)
exponential: input>=0         rel=1.0       ulp=inf
             input>=8.97e-44  rel=1.0       ulp=128
             input>=2.35e-38  rel=1.53e-05  ulp=128    (caps at ~bf16 precision)
logistic:    input>=0         rel=4.0       ulp=6.71e+07
             input>=9.54e-07  rel=2.0       ulp=3.36e+07
             input>=3.82e-06  rel=1.0       ulp=1.68e+07
             input>=3.05e-05  rel=4.88e-04  ulp=8192
tan:         input>=0..32 over 7 segments: rel 4.77e-07→5.96e-08, ulp 8→1

Ghostlite (best gen) tightens these dramatically, e.g. divide input>=2.35e-38 → rel=1.19e-07 ulp=1 (vs 1.53e-05/128 on jf); log input>=1.53e-05 → rel=2.38e-07 ulp=1 (vs 4.88e-04/4096); logistic input>=1.19e-07 → rel=1.19e-07 ulp=1 (vs 4.0/6.7e7).


Consumer Chain — the precision decision

Four functions in select_accuracy.cc turn the table into an implementation choice:

  1. GetTargetAccuracy(HloOpcode, Target&, bool variant) @ 0x1d60dea0 — the loader. Reads Target+0x398 = tpu_version(), builds the embed:// URI via TpuVersionToString, calls tsl::ReadBinaryProto to parse the blob, then GetAccuracyEntry. Returns StatusOr<AccuracyEntry>; a failed blob read produces a Status via CreateStatusAndConditionallyLog at select_accuracy.cc:143 (the table-missing path taken for 6acc60406).
  2. GetAccuracyEntry(string_view name, AccuracyTable&, bool variant) @ 0x1d60e040 — linear scan over table.entries, matching (entry.name == name) (via bcmp) and the variant byte at entry+80 (== variant). On no match it builds an error Status (MakeErrorImpl<5>) with "No accuracy entry found for opcode: " (select_accuracy.cc:129).
  3. CompareAccuracy(ResultAccuracy&, StatusOr<Entry> fast, StatusOr<Entry> accurate, HloOpcode) @ 0x1d60d7a0 — the tolerance check. Reads ResultAccuracy.mode (a2+28); when mode is TOLERANCE it loads the tolerance struct (a2+16, fields at +0x18/+0x20) and vucomisd-compares it against the entry's top-level bound (entry+0x40); it short-circuits to the fast (variant-0) entry if that passes, else the accurate (variant-1) entry. If neither status is OK it emits "Only accurate implementation is available." (:33) or "Only EUP implementation is available." (:35); if no variant meets the request it returns an error Status (MakeErrorImpl<5>, :44) reading "No implementation found for requested accuracy: … for opcode: <op>".
  4. GetImplType(Target&, ResultAccuracy&, HloOpcode) @ 0x1d60db80 — the master decision. Loads both variants (GetTargetAccuracy(...,0) and (...,1)) and branches on ResultAccuracy.mode (a3+28): DEFAULT returns the available impl (warning if the preferred one is absent); HIGHEST forces the EUP path; TOLERANCE (mode == 2) calls CompareAccuracy. An unhandled mode does not abort — it returns an error Status (MakeErrorImpl<3>, :79) reading "Unsupported result accuracy for opcode <op>: <message>".
missing-variant fallback inside GetImplType (also the whole-table-missing case
for 6acc60406, whose embed:// lookup returns NOT_FOUND for both variants):
  EUP entry missing      → WARN "No EUP implementation is available, returning accurate implementation."   (select_accuracy.cc:69)
  accurate entry missing → WARN "No accurate implementation is available, returning default implementation." (select_accuracy.cc:61)

NOTE — CompareAccuracy (0x1d60d7a0) performs direct vucomisd comparisons of the ResultAccuracy tolerance fields against the entry's top-level bound (entry+0x40); there is no ulps × 2^-23 relative-error conversion and no mulsd-by-constant in any of the four consumer bodies. GetImplType returns an error Status for an unhandled mode, not a fatal abort. The decompile-confirmed decision logic (CONFIRMED): both GetTargetAccuracy calls; the DEFAULT / HIGHEST / TOLERANCE (mode == 2) dispatch; the two missing-variant warnings at :69/:61; and the "Unsupported result accuracy" error return at :79.

The optional<ResultAccuracy> originates from the HLO instruction's result_accuracy attribute and threads down into the LloRegionBuilder transcendental emitters (e.g. Sln @ 0x1d534520 computes log(x) = Vlog2(F32, x, accuracy) * ln2), where GetImplType chooses EUP vs the *NoEupF32 software expansion; DecomposeEupInstruction @ 0x126a0340 lowers the chosen EUP op.


Accuracy vs Auto-Cast (different layers)

The accuracy table chooses the transcendental approximation algorithm (fast XLU/EUP vs accurate software polynomial); it never changes the storage dtype (type is always F32). Storage-dtype and matmul-accumulation precision are a separate layer — AutoMixedPrecision, PrecisionConfig, MayIncreaseBF16AllReduceAccumulationAccuracy @ 0x127a22c0. Both feed the cost model: a fast approx that passes CompareAccuracy keeps the op in the cheap XLU path, which is why per-gen accuracy and per-gen transcendental cycle cost improve together (sin/cos 198→154→142, tan 219→170→151 across jf→viperfish→ghostlite). The compiled_with_accuracy_setting flag @ 0x22581ff0 (FEATURE_KIND_RESULTS_ACCURACY_SETTING) records whether the executable was compiled with explicit accuracy settings, for cache-key/replay.


NameRelationship
tpu::TpuVersionToString @ 0x20b3a480resolves codename for the embed:// URI
tsl::ReadBinaryProto @ 0x20d02060parses the embedded blob (re-parsed per call)
LloRegionBuilder::Sln / Vlog2 / *NoEupF32the emitters GetImplType routes between
DecomposeEupInstruction @ 0x126a0340lowers the chosen EUP transcendental

Cross-References