Bias-Add & Quant/Dequant Helpers
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, not stripped)..textand.rodataVMAs equal their file offsets;.data.rel.roVMA minus0x200000equals its file offset. Other libtpu builds will differ.
Abstract
These are the numerics helpers that bracket the systolic-array matmul/conv on the TensorCore — the layer between the raw int32/f32 accumulator coming out of the MXU and the value finally stored. Concretely: the bias-add, the activation clamp, and the quantize/dequantize sequences (scale application, narrow-float and integer converts, the saturating clamp that closes the round-trip). A reimplementer hunting for an EmitBias or a QuantizeOp will not find one, and that absence is the first finding.
There is no dedicated bias op. A "matmul + bias" HLO is an output fusion whose root is the conv and whose body is an ordinary elementwise add(conv, broadcast(bias)), emitted by the generic fusion vectorizer per output chunk inside MatrixMultiplyAccumulateFunctor::DoOutputFusion. The bias is broadcast to the tile shape (HandleBroadcast), added in f32 (HandleAdd → VaddF32), the activation runs (HandleMaximum/HandleClamp), and the result is requantized (HandleConvert) and stored. The accumulator dtype is dtype-keyed: int8 matmuls accumulate in int32 (VaddS32) and are dequantized to f32 before the float epilogue; bf16/fp32 matmuls accumulate in f32 (VaddF32), with two dedicated mixed-precision adds (0x11e/0x11f) folding bf16 high/low passes into the f32 accumulator.
The quant/dequant leaves are single LLO vector ops built by LloRegionBuilder::Vcvt*/Vpack*/Vunpack*. Round-to-nearest narrow-float quantize goes through the EXMY converts (0x6e/0x6f) or the bf16→fp8 packs; stochastic rounding goes through four VcvtSrF32To* binops (0x70-0x73) whose second operand is the per-lane random dither; int8 quantize is round → VclampSymmetric → VpackcB8. On the device fast path, quantization is symmetric (zero-point implicitly 0); the asymmetric / zero-point model lives only in the host-side MLIR quant dialect. The cleanest in-binary reference implementation of symmetric absmax 8-bit quantization is the collective pincer_utils family, which is documented here op-by-op.
For reimplementation, the contract is:
- Bias-add is fusion, not an op: the
AccumulateOutputChunk → DoOutputFusionspine and theVectorizingEmitterHandle*dispatch (HandleBroadcast= bias splat,HandleAdd= bias add). - The accumulator dtype split: int32
VaddS32(int8 matmul) vs f32VaddF32(bf16/fp32), plus the bf16-in-f32 mixed adds0x11e/0x11f. - The convert leaves: EXMY narrow-float (
0x6e/0x6f), the four stochastic-round binops (0x70-0x73),VcvtF32ToS32round-direction, and the lane pack/unpack helpers. - The saturate layer:
VclampSymmetric*(native + min/max fallback0x15f) vsVclampAsymmetric*(binop0x4c+ min/max), andVclampGez*(ReLU). - The symmetric absmax quantizer (
pincer_utils): absmax → scale (qmax/absmax) → quantize → 8-bit wire → f32 requant-reduce → dequant, with the three per-dtype max constants.
| Bias-add path | MatrixMultiplyAccumulateFunctor::DoOutputFusion (0x13124360); HandleBroadcast (0x136f1d60) + HandleAdd (0x13e22e80) |
| Accumulator | int8 → int32 VaddS32 (0x1d51e9c0); bf16/fp32 → f32 VaddF32 (0x1d525160); mixed 0x11e/0x11f |
| RNE narrow-float | VcvtEXMYToE4M3/E5M2 (0x6e/0x6f); VpackCBf16ToE5M2/E4M3Fn (0x126) |
| Stochastic round | VcvtSrF32To{E5M2,E4M3,If8,Bf16} (0x70-0x73), binop, rng-bits 2nd operand |
| f32→int32 | VcvtF32ToS32 (0x1d524760): RoundNearest→VroundF32, else VtruncF32, then 0x5e |
| Clamp | VclampSymmetricF32/Bf16 (native + 0x15f fallback); VclampAsymmetric* (0x4c); VclampGez* (ReLU) |
| Collective quantizer | pincer_utils: UpdateScale (0x137b75c0), GetUnpackFunction (0x137b8720); types {S8, F8E5M2, F8E4M3B11FNUZ} |
Bias-Add Is the Output-Fusion Epilogue
Purpose
A reimplementer needs to know that bias, activation, and output-requantize are not separate passes over a materialized matmul output. They are emitted inline per output chunk, reading the matmul result directly out of the MXU result registers — the whole point of "output fusion".
Algorithm
The post-canonicalization "matmul + bias + activation" HLO is an output fusion:
fusion {
conv = convolution(activations, weights) // MXU
bcast = broadcast(bias) // bias to output tile shape
add = add(conv, bcast) // bias add
[opt] = maximum(add, 0) / clamp(...) / tanh // activation
[opt] = convert(...) -> out_dtype // (re)quantize
}
lowered by the per-chunk MXU emission spine:
MatrixMultiplyAccumulateFunctor::operator() // 0x1310cd80
-> AccumOutputChunkAndPossiblyDoOutputFusion // 0x13124100
-> AccumulateOutputChunk // 0x13123a80 : matres -> accumulator
-> RegisterAccumOutput // 0x131196a0
-> PossiblyDoOutputFusion // 0x13122d80
-> DoOutputFusion // 0x13124360
DoOutputFusion (0x13124360) branches on ShapeUtil::ElementIsIntegral/ElementIsFloating(out), dequantizes an int32 accumulator with VcvtS32ToF32 before the float epilogue (so bias-add and activation run in f32 even for int8 matmuls), then walks the fusion body op-by-op through the VectorizingEmitter Handle* dispatch, and finally VreducePrecision (bf16 rounding, unless AllowExcessPrecision) + Vst.
The elementwise op chain is JIT-emitted by the handlers:
| Handler | Address | Role |
|---|---|---|
HandleBroadcast | 0x136f1d60 | Bias broadcast (splat to tile) |
HandleAdd | 0x13e22e80 | Bias add (VaddF32/VaddS32, dtype-keyed) |
HandleConvert | 0x13e22040 | In-fusion (re)quant/dequant |
HandleMaximum | 0x13e21f80 | ReLU / clip-low |
HandleClamp | 0x13e21f20 | Clamp both sides |
HandleMultiply | 0x13e228c0 | Scale multiply |
HandleReducePrecision | 0x13e23a40 | bf16 round-to-nearest-even |
HandleStochasticConvert | 0x13e22520 | HLO kStochasticConvert |
HandleBroadcast reads the broadcast operand, splats the bias across the output tile (PackOrMakeTuple), and materializes it via OpaqueCopy. The bias operand enters through FusionEmitter::AddInput (0x136d5f20).
NOTE — there is no
BiasAddEmitter/EmitBias. Bias-add is the generic fusion vectorizer emittingAdd(value, Broadcast(bias)). Treat "bias" as an ordinary fused input edge, not a special op.
Accumulator Dtype and bf16-in-f32 Accumulation
Purpose
The accumulate op is selected by the output element type, and the high-accuracy bf16 matmul modes need dedicated mixed-precision adds. A reimplementer must pick the right accumulate op per dtype or silently lose precision.
Algorithm
AccumulateOutputChunk (0x13123a80) selects:
- int8 / int4 matmul → int32 accumulator →
VaddS32(0x1d51e9c0) - bf16 / fp32 matmul → f32 accumulator →
VaddF32(0x1d525160)
The bf16-in-f32 multi-pass accumulation (the bf16x2/bf16x3 high-accuracy modes) uses two dedicated mixed adds that fold a bf16 result half into an f32 accumulator:
| Builder method | Address | LLO opcode | Role |
|---|---|---|---|
VaddF32MixedBF16HighInst | 0x1d55dd40 | 0x11e | Accumulate bf16 HIGH pass into f32 |
VaddF32MixedBF16LowInst | 0x1d55dd80 | 0x11f | Accumulate bf16 LOW pass into f32 |
VaddF32 | 0x1d525160 | (binop) | Plain f32 accumulate |
VaddS32 | 0x1d51e9c0 | (binop) | Plain int32 accumulate |
The decompile confirms VaddF32MixedBF16HighInst calling CreateVectorBinop(0x11E, …). The accumulate is always f32-wide (the systolic array multiplies bf16 but partial products accumulate in an f32 register file), which is why fp32 and bf16 share the same per-matmul MXU latency.
The Convert Leaves: RNE and Stochastic Rounding
Purpose
Every fp8/int8 quantize and the inverse dequant is a single LLO op. A reimplementer needs the opcode and the creator for each so the conversion is one instruction, not an emulated sequence (on capable targets).
Encoding — round-to-nearest narrow float
function VcvtEXMYToE4M3(in, exp_bits): // 0x1d52b440
assert target().SupportsFloat8EXMY()
inst = CreateVectorEXMYConversion(110 /*0x6e*/, …)
return region->AppendInstruction(inst)
| Method | Address | LLO opcode | Creator |
|---|---|---|---|
VcvtEXMYToE4M3 | 0x1d52b440 | 0x6e | CreateVectorEXMYConversion (gate SupportsFloat8EXMY) |
VcvtEXMYToE5M2 | 0x1d52b520 | 0x6f | CreateVectorEXMYConversion |
VpackCBf16ToE5M2 | 0x1d567a40 | 0x126 | CreateVectorPack (VpackFormat) |
VpackCBf16ToE4M3Fn | 0x1d567b20 | 0x126 | CreateVectorPack (VpackFormat) |
VcvtF32ToNarrowFloat | 0x1d560960 | (composite) | VimmIf+VabsF32+SimplifyPackF16 (software RNE) |
Encoding — stochastic rounding
Each Sr convert is one binop whose second operand is the per-lane random dither, so the convert rounds up with probability equal to the fractional part:
function VcvtSrF32ToE5M2(src_f32, rng_bits): // 0x1d52b600
inst = CreateVectorBinop(0x70, src_f32, rng_bits, region)
return region->AppendInstruction(inst)
| Method | Address | LLO opcode |
|---|---|---|
VcvtSrF32ToE5M2 | 0x1d52b600 | 0x70 |
VcvtSrF32ToE4M3 | 0x1d52b640 | 0x71 |
VcvtSrF32ToIf8 | 0x1d52b680 | 0x72 |
VcvtSrF32ToBf16 | 0x1d52b6c0 | 0x73 |
VcvtF32ToS32Stochastic | 0x1d52ade0 | 0x5e (fast) / 0x15c (sw fallback) |
The decompile confirms VcvtSrF32ToE5M2 → CreateVectorBinop(0x70u, …) and VcvtSrF32ToE4M3 → CreateVectorBinop(0x71u, …). There is no software emulation for the fp8/bf16 Sr converts on capable targets — they are one op each. The HLO kStochasticConvert (HandleStochasticConvert 0x13e22520) dispatches to this family for fp8/bf16 outputs and to VcvtF32ToS32Stochastic for integer outputs.
Per-gen stochastic support: Target::SupportsVectorConvertF32Stochastic::kSupportedTypes = {10, 16, 19, 23} = {F16, BF16, F8E5M2, F8E4M3B11FNUZ} on both Ghostlite (0xb530678) and Viperfish (0xb530784). F8E4M3Fn (20) is not in the Sr-supported set — only the B11FNUZ fp8 variant has native f32→fp8 stochastic rounding.
Encoding — f32 → int32 round direction
VcvtF32ToS32(value, RoundDirection) (0x1d524760): RoundNearest (1) prepends VroundF32, otherwise (TowardsZero) prepends VtruncF32, then emits opcode 0x5e. Named pseudo-variants: VcvtF32ToS32TiesToEven (0x1d52ab00), ...TowardsZeroPseudo (0x1d52ad20), ...KnownIntegral (0x1d52ad80), ...Stochastic (0x1d52ade0).
The Saturating Clamp
Purpose
The quantize round-trip and the activation close with a saturating clamp into the destination range. The symmetric vs asymmetric split is the device-symmetric vs host-zero-point distinction surfaced as two builder families.
Encoding
function VclampSymmetricF32(value, bound): // 0x1d527fc0
if target supports native clamp:
inst = CreateVectorClampSymmetric(value, bound, …)
else: // fallback
v = CreateVectorMinimumF32(value, +bound) // opcode 0x15f path
inst = CreateVectorMaximumF32(v, -bound)
return region->AppendInstruction(inst)
| Method | Address | Body |
|---|---|---|
VclampSymmetricF32 | 0x1d527fc0 | CreateVectorClampSymmetric + min/max (0x15f) fallback |
VclampSymmetricBf16 | 0x1d5283a0 | same |
VclampAsymmetricF32 | 0x1d528480 | binop 0x4c + Minimum/Maximum (lo ≠ −hi) |
VclampAsymmetricBF16 | 0x1d528740 | same |
VclampGezF32 / Bf16 | 0x1d527ea0 / 0x1d527ee0 | clamp ≥ 0 (ReLU) |
VclampS32 / U32 / Float | 0x1d528820 / 0x1d528c40 / 0x1d528fe0 | integer / generic clamp |
The decompile of VclampSymmetricF32 shows both the native CreateVectorClampSymmetric arm and the CreateVectorMinimumF32/CreateVectorMaximumF32 fallback with opcode 0x15f. Symmetric saturates to [-bound, +bound] (zero-point 0) — the int8/fp8 device saturate. Asymmetric saturates to [lo, hi] with lo ≠ -hi — the with-zero-point / unsigned host path.
VreducePrecision (0x1d5697c0, called from DoOutputFusion and HandleReducePrecision) is the bit-manipulation round-to-nearest-even that truncates an f32 mantissa to bf16: add the rounding bias to the mantissa, mask the low bits, fix the tie-to-even and the NaN/Inf exponent. The LLO op is llo.vround.nearest_even.f32 (VroundF32 0x1d525480).
The Collective Symmetric Quantizer (pincer_utils)
Purpose
The quantized all-reduce quantizes each shard to 8 bits before the wire write, reduces in 8-bit-on-wire with a running scale, and dequantizes at the end. These xla::jellyfish::pincer_utils helpers are the cleanest in-binary reference for symmetric absmax 8-bit quantization, and a reimplementer can lift the scale formula directly.
Algorithm
function UpdateScale(dtype, builder, max_abs_addr, scale_addr): // 0x137b75c0
switch dtype:
case S8(0x2): dtype_max = 127.0 // dword_84A2A28
case F8E4M3B11FNUZ(0x17): dtype_max = 30.0 // dword_84A27FC
case F8E5M2(0x13): dtype_max = 57344.0 // dword_84A2530
default: LogFatal("quantized_type_bound != nullptr") + PrimitiveType_Name(dtype)
m = VimmF32(dtype_max)
max_abs = Vld(max_abs_addr) // expected in VMEM (checked)
scale = VdivF32(m, max_abs) // scale_factor = qmax / absmax
Vst(scale_addr, scale)
The decompile fixes the dtype switch and the three .rodata floats exactly, and — importantly — fixes the division operand order as VdivF32(dtype_max, max_abs), i.e. the stored scale factor is qmax / absmax (the quantize multiplier). Quantize is then q = round(x * scale_factor); dequantize is x = q / scale_factor. The per-dtype max constants:
.rodata address | Value | dtype | PrimitiveType |
|---|---|---|---|
0x84a2a28 | 127.0 | int8 max | 0x2 |
0x84a27fc | 30.0 | F8E4M3B11FNUZ max finite | 0x17 |
0x84a2530 | 57344.0 | F8E5M2 max finite | 0x13 |
The five-stage round trip:
| Helper | Address | Role |
|---|---|---|
UpdateMaxLocalChunk | 0x137b73a0 | Running absmax: Vld → VunpackCF32 (if bf16) → VabsF32 → VmaxF32 → Vst |
UpdateScale | 0x137b75c0 | absmax → scale (qmax/absmax); 3 dtype-max consts |
SymmetricallyQuantizeShardInPlaceTo8Bits | 0x137b7740 | round(x*scale) → VpackcB16 → VpackcB8 (×4 unrolled) |
SymmetricallyDequantizeShardInPlace8Bit | 0x137b7fc0 | unpack → VcvtS32ToF32 → /scale → VpackCBf16 |
ReduceSymmetricallyQuantized8BitShardInPlace | 0x137b8880 | f32 requant-reduce (dequant → merge → re-track absmax) |
GetUnpackFunction | 0x137b8720 | dtype → unpack fn pointer |
GetUnpackFunction (0x137b8720) maps S8(0x2)→VunpackCS8, F8E5M2(0x13)→VunpackCF8E5M2, F8E4M3B11FNUZ(0x17)→VunpackCF8E4M3B11, else a StatusOr error carrying the PrimitiveType name. The ring reduction (ReduceSymmetricallyQuantized8BitShardInPlace) is done in f32 — dequant → merge functor → re-track absmax — so the 8-bit format is only the wire representation, not the arithmetic. RotatedPincerQuantizedEmitter::kSupportedQuantizationTypes (0xae5b90c) = {2, 19, 23} = {S8, F8E5M2, F8E4M3B11FNUZ}.
NOTE — there is no zero-point in the device-side symmetric quantizers (
pincer_utils+Vcvt*): the on-device fast paths are symmetric (zero_point == 0). Asymmetric / zero-point quantization exists only at the host-side MLIR quant-dialect layer.
Per-Tensor vs Per-Channel vs Sub-Channel Scale (Host)
On the device codegen path quantization is symmetric and the granularity is the shard / vector tile (one scale, implicit zero-point 0). The richer scale + zero-point model lives in the host-side MLIR quant dialect, fully linked in libtpu:
| MLIR type | Scale | Zero-point | Extra fields | Granularity |
|---|---|---|---|---|
UniformQuantizedType | getScale() | getZeroPoint() | — | Per-tensor |
UniformQuantizedPerAxisType | getScales()[] | getZeroPoints()[] | getQuantizedDimension() | Per-channel |
UniformQuantizedSubChannelType | getScales()[] | getZeroPoints()[] | getBlockSizes()[], getQuantizedDimensions()[] | Block-wise |
CalibratedQuantizedType | getMin()/getMax() | — | (no scale yet) | Calibration |
These lower through mlir::quant::stablehlo::ConvertUniform{Quantize,Dequantize,Requantize,QuantizedDotGeneral,QuantizedConvolution,QuantizedAdd,QuantizedClipByValue,...}Op (ten distinct ConvertUniform*Op families, exposed as 40 QuantizedStablehloOpConversion::matchAndRewrite instantiations) to integer dot/conv + affine requant round((scale_in/scale_out)·(acc - zp_in)) + zp_out. The guard string "Cannot requantize while changing quantization_axis" confirms a requantize cannot move the per-channel axis. A TPU dynamic per-column int8 quantizer (convert_dynamic_quantize_ops → damax_output → per-column scale) is flag-gated by xla_tpu_experimental_enable_dynamic_int8_quantization.
Worked Example: int8 Matmul + Bias + ReLU → int8
HLO (post DotCanonicalizer + output fusion):
fusion {
conv = convolution(int8 act, int8 weights) : s32 // MXU, int32 accumulate
bc = broadcast(f32 bias)
deq = convert(conv) : f32
add = add(deq, bc)
relu = maximum(add, 0)
req = convert(relu) : s8
}
Emission per output chunk (inside MatrixMultiplyAccumulateFunctor):
AccumulateOutputChunk: int32 accumulator (VaddS32 over K-tiles)
DoOutputFusion:
VcvtS32ToF32(acc) // int32 -> f32 dequant
HandleBroadcast(bias) -> OpaqueCopy // bias splat to tile
HandleAdd -> VaddF32 // + bias
HandleMaximum -> VclampGezF32 // ReLU
HandleConvert -> VmulF32(.,scale) + VcvtF32ToS32(RNE)
+ VclampSymmetricF32(127) + VpackcB16/VpackcB8 // requant s8
Vst // store int8 output
What Is Not Decoded
- The host-side affine-requant constants (
M0, shift) insideConvertUniformRequantizeOp/ConvertUniformQuantizedDotOp: the pattern symbols and the structural formula are recovered; thematchAndRewritewas not unwound op-by-op (upstream open-source legalization). - The TPU
DynamicQuantizecustom-op emitter body (per-column damax → scale → int8) and thexla::jellyfish::QuantizationConfigproto field layout. - The
VpackFormatordinals distinguishing E5M2 / E4M3Fn / E4M3B11 inside0x126 CreateVectorPack(the three bf16→fp8 pack methods are named; the enum values were not bound here — see Pack/Unpack Precision). - The exact bit layout of
VcvtF32ToNarrowFloat's software RNE composite per fp8 format (exponent bias, subnormal flush, overflow-to-Inf vs saturate). - The
VclampAsymmetriclo/hi operand provenance (quant-type zero-point vs explicit fusion clip constants): confirmed asymmetric (lo ≠ -hi) but the operand origin not traced.
Cross-References
- Pack/Unpack Precision — the
VpackFormatenum,set_pack_format_sublane, and the bf16↔f32 lane pack/unpack the quant pack/unpack helpers share - VPU (Vector-ALU) Slot — the VALU slot carrying the
vcvt/vclamp/vaddopcode immediates and the.srstochastic-round convert family - EUP / Transcendental Slot —
SupportsBf16AluInstructionsand the lane-width model behind the bf16 accumulate and unpack - XLU Op Roster — the broader XLU op-to-factory table this convert/clamp family sits beside
- Bundle Model — the VLIW bundle the convert/clamp/add ops schedule into per generation