Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Bias-Add & Quant/Dequant Helpers

All addresses on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build-id 89edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, not stripped). .text and .rodata VMAs equal their file offsets; .data.rel.ro VMA minus 0x200000 equals its file offset. Other libtpu builds will differ.

Abstract

These are the numerics helpers that bracket the systolic-array matmul/conv on the TensorCore — the layer between the raw int32/f32 accumulator coming out of the MXU and the value finally stored. Concretely: the bias-add, the activation clamp, and the quantize/dequantize sequences (scale application, narrow-float and integer converts, the saturating clamp that closes the round-trip). A reimplementer hunting for an EmitBias or a QuantizeOp will not find one, and that absence is the first finding.

There is no dedicated bias op. A "matmul + bias" HLO is an output fusion whose root is the conv and whose body is an ordinary elementwise add(conv, broadcast(bias)), emitted by the generic fusion vectorizer per output chunk inside MatrixMultiplyAccumulateFunctor::DoOutputFusion. The bias is broadcast to the tile shape (HandleBroadcast), added in f32 (HandleAddVaddF32), the activation runs (HandleMaximum/HandleClamp), and the result is requantized (HandleConvert) and stored. The accumulator dtype is dtype-keyed: int8 matmuls accumulate in int32 (VaddS32) and are dequantized to f32 before the float epilogue; bf16/fp32 matmuls accumulate in f32 (VaddF32), with two dedicated mixed-precision adds (0x11e/0x11f) folding bf16 high/low passes into the f32 accumulator.

The quant/dequant leaves are single LLO vector ops built by LloRegionBuilder::Vcvt*/Vpack*/Vunpack*. Round-to-nearest narrow-float quantize goes through the EXMY converts (0x6e/0x6f) or the bf16→fp8 packs; stochastic rounding goes through four VcvtSrF32To* binops (0x70-0x73) whose second operand is the per-lane random dither; int8 quantize is round → VclampSymmetric → VpackcB8. On the device fast path, quantization is symmetric (zero-point implicitly 0); the asymmetric / zero-point model lives only in the host-side MLIR quant dialect. The cleanest in-binary reference implementation of symmetric absmax 8-bit quantization is the collective pincer_utils family, which is documented here op-by-op.

For reimplementation, the contract is:

  • Bias-add is fusion, not an op: the AccumulateOutputChunk → DoOutputFusion spine and the VectorizingEmitter Handle* dispatch (HandleBroadcast = bias splat, HandleAdd = bias add).
  • The accumulator dtype split: int32 VaddS32 (int8 matmul) vs f32 VaddF32 (bf16/fp32), plus the bf16-in-f32 mixed adds 0x11e/0x11f.
  • The convert leaves: EXMY narrow-float (0x6e/0x6f), the four stochastic-round binops (0x70-0x73), VcvtF32ToS32 round-direction, and the lane pack/unpack helpers.
  • The saturate layer: VclampSymmetric* (native + min/max fallback 0x15f) vs VclampAsymmetric* (binop 0x4c + min/max), and VclampGez* (ReLU).
  • The symmetric absmax quantizer (pincer_utils): absmax → scale (qmax/absmax) → quantize → 8-bit wire → f32 requant-reduce → dequant, with the three per-dtype max constants.
Bias-add pathMatrixMultiplyAccumulateFunctor::DoOutputFusion (0x13124360); HandleBroadcast (0x136f1d60) + HandleAdd (0x13e22e80)
Accumulatorint8 → int32 VaddS32 (0x1d51e9c0); bf16/fp32 → f32 VaddF32 (0x1d525160); mixed 0x11e/0x11f
RNE narrow-floatVcvtEXMYToE4M3/E5M2 (0x6e/0x6f); VpackCBf16ToE5M2/E4M3Fn (0x126)
Stochastic roundVcvtSrF32To{E5M2,E4M3,If8,Bf16} (0x70-0x73), binop, rng-bits 2nd operand
f32→int32VcvtF32ToS32 (0x1d524760): RoundNearestVroundF32, else VtruncF32, then 0x5e
ClampVclampSymmetricF32/Bf16 (native + 0x15f fallback); VclampAsymmetric* (0x4c); VclampGez* (ReLU)
Collective quantizerpincer_utils: UpdateScale (0x137b75c0), GetUnpackFunction (0x137b8720); types {S8, F8E5M2, F8E4M3B11FNUZ}

Bias-Add Is the Output-Fusion Epilogue

Purpose

A reimplementer needs to know that bias, activation, and output-requantize are not separate passes over a materialized matmul output. They are emitted inline per output chunk, reading the matmul result directly out of the MXU result registers — the whole point of "output fusion".

Algorithm

The post-canonicalization "matmul + bias + activation" HLO is an output fusion:

fusion {
  conv  = convolution(activations, weights)   // MXU
  bcast = broadcast(bias)                      // bias to output tile shape
  add   = add(conv, bcast)                     // bias add
  [opt] = maximum(add, 0) / clamp(...) / tanh  // activation
  [opt] = convert(...) -> out_dtype            // (re)quantize
}

lowered by the per-chunk MXU emission spine:

MatrixMultiplyAccumulateFunctor::operator()                       // 0x1310cd80
  -> AccumOutputChunkAndPossiblyDoOutputFusion                    // 0x13124100
       -> AccumulateOutputChunk                                   // 0x13123a80 : matres -> accumulator
       -> RegisterAccumOutput                                     // 0x131196a0
       -> PossiblyDoOutputFusion                                  // 0x13122d80
            -> DoOutputFusion                                     // 0x13124360

DoOutputFusion (0x13124360) branches on ShapeUtil::ElementIsIntegral/ElementIsFloating(out), dequantizes an int32 accumulator with VcvtS32ToF32 before the float epilogue (so bias-add and activation run in f32 even for int8 matmuls), then walks the fusion body op-by-op through the VectorizingEmitter Handle* dispatch, and finally VreducePrecision (bf16 rounding, unless AllowExcessPrecision) + Vst.

The elementwise op chain is JIT-emitted by the handlers:

HandlerAddressRole
HandleBroadcast0x136f1d60Bias broadcast (splat to tile)
HandleAdd0x13e22e80Bias add (VaddF32/VaddS32, dtype-keyed)
HandleConvert0x13e22040In-fusion (re)quant/dequant
HandleMaximum0x13e21f80ReLU / clip-low
HandleClamp0x13e21f20Clamp both sides
HandleMultiply0x13e228c0Scale multiply
HandleReducePrecision0x13e23a40bf16 round-to-nearest-even
HandleStochasticConvert0x13e22520HLO kStochasticConvert

HandleBroadcast reads the broadcast operand, splats the bias across the output tile (PackOrMakeTuple), and materializes it via OpaqueCopy. The bias operand enters through FusionEmitter::AddInput (0x136d5f20).

NOTE — there is no BiasAddEmitter / EmitBias. Bias-add is the generic fusion vectorizer emitting Add(value, Broadcast(bias)). Treat "bias" as an ordinary fused input edge, not a special op.


Accumulator Dtype and bf16-in-f32 Accumulation

Purpose

The accumulate op is selected by the output element type, and the high-accuracy bf16 matmul modes need dedicated mixed-precision adds. A reimplementer must pick the right accumulate op per dtype or silently lose precision.

Algorithm

AccumulateOutputChunk (0x13123a80) selects:

  • int8 / int4 matmul → int32 accumulator → VaddS32 (0x1d51e9c0)
  • bf16 / fp32 matmul → f32 accumulator → VaddF32 (0x1d525160)

The bf16-in-f32 multi-pass accumulation (the bf16x2/bf16x3 high-accuracy modes) uses two dedicated mixed adds that fold a bf16 result half into an f32 accumulator:

Builder methodAddressLLO opcodeRole
VaddF32MixedBF16HighInst0x1d55dd400x11eAccumulate bf16 HIGH pass into f32
VaddF32MixedBF16LowInst0x1d55dd800x11fAccumulate bf16 LOW pass into f32
VaddF320x1d525160(binop)Plain f32 accumulate
VaddS320x1d51e9c0(binop)Plain int32 accumulate

The decompile confirms VaddF32MixedBF16HighInst calling CreateVectorBinop(0x11E, …). The accumulate is always f32-wide (the systolic array multiplies bf16 but partial products accumulate in an f32 register file), which is why fp32 and bf16 share the same per-matmul MXU latency.


The Convert Leaves: RNE and Stochastic Rounding

Purpose

Every fp8/int8 quantize and the inverse dequant is a single LLO op. A reimplementer needs the opcode and the creator for each so the conversion is one instruction, not an emulated sequence (on capable targets).

Encoding — round-to-nearest narrow float

function VcvtEXMYToE4M3(in, exp_bits):                  // 0x1d52b440
    assert target().SupportsFloat8EXMY()
    inst = CreateVectorEXMYConversion(110 /*0x6e*/, …)
    return region->AppendInstruction(inst)
MethodAddressLLO opcodeCreator
VcvtEXMYToE4M30x1d52b4400x6eCreateVectorEXMYConversion (gate SupportsFloat8EXMY)
VcvtEXMYToE5M20x1d52b5200x6fCreateVectorEXMYConversion
VpackCBf16ToE5M20x1d567a400x126CreateVectorPack (VpackFormat)
VpackCBf16ToE4M3Fn0x1d567b200x126CreateVectorPack (VpackFormat)
VcvtF32ToNarrowFloat0x1d560960(composite)VimmIf+VabsF32+SimplifyPackF16 (software RNE)

Encoding — stochastic rounding

Each Sr convert is one binop whose second operand is the per-lane random dither, so the convert rounds up with probability equal to the fractional part:

function VcvtSrF32ToE5M2(src_f32, rng_bits):            // 0x1d52b600
    inst = CreateVectorBinop(0x70, src_f32, rng_bits, region)
    return region->AppendInstruction(inst)
MethodAddressLLO opcode
VcvtSrF32ToE5M20x1d52b6000x70
VcvtSrF32ToE4M30x1d52b6400x71
VcvtSrF32ToIf80x1d52b6800x72
VcvtSrF32ToBf160x1d52b6c00x73
VcvtF32ToS32Stochastic0x1d52ade00x5e (fast) / 0x15c (sw fallback)

The decompile confirms VcvtSrF32ToE5M2CreateVectorBinop(0x70u, …) and VcvtSrF32ToE4M3CreateVectorBinop(0x71u, …). There is no software emulation for the fp8/bf16 Sr converts on capable targets — they are one op each. The HLO kStochasticConvert (HandleStochasticConvert 0x13e22520) dispatches to this family for fp8/bf16 outputs and to VcvtF32ToS32Stochastic for integer outputs.

Per-gen stochastic support: Target::SupportsVectorConvertF32Stochastic::kSupportedTypes = {10, 16, 19, 23} = {F16, BF16, F8E5M2, F8E4M3B11FNUZ} on both Ghostlite (0xb530678) and Viperfish (0xb530784). F8E4M3Fn (20) is not in the Sr-supported set — only the B11FNUZ fp8 variant has native f32→fp8 stochastic rounding.

Encoding — f32 → int32 round direction

VcvtF32ToS32(value, RoundDirection) (0x1d524760): RoundNearest (1) prepends VroundF32, otherwise (TowardsZero) prepends VtruncF32, then emits opcode 0x5e. Named pseudo-variants: VcvtF32ToS32TiesToEven (0x1d52ab00), ...TowardsZeroPseudo (0x1d52ad20), ...KnownIntegral (0x1d52ad80), ...Stochastic (0x1d52ade0).


The Saturating Clamp

Purpose

The quantize round-trip and the activation close with a saturating clamp into the destination range. The symmetric vs asymmetric split is the device-symmetric vs host-zero-point distinction surfaced as two builder families.

Encoding

function VclampSymmetricF32(value, bound):              // 0x1d527fc0
    if target supports native clamp:
        inst = CreateVectorClampSymmetric(value, bound, …)
    else:                                                // fallback
        v = CreateVectorMinimumF32(value, +bound)        // opcode 0x15f path
        inst = CreateVectorMaximumF32(v, -bound)
    return region->AppendInstruction(inst)
MethodAddressBody
VclampSymmetricF320x1d527fc0CreateVectorClampSymmetric + min/max (0x15f) fallback
VclampSymmetricBf160x1d5283a0same
VclampAsymmetricF320x1d528480binop 0x4c + Minimum/Maximum (lo ≠ −hi)
VclampAsymmetricBF160x1d528740same
VclampGezF32 / Bf160x1d527ea0 / 0x1d527ee0clamp ≥ 0 (ReLU)
VclampS32 / U32 / Float0x1d528820 / 0x1d528c40 / 0x1d528fe0integer / generic clamp

The decompile of VclampSymmetricF32 shows both the native CreateVectorClampSymmetric arm and the CreateVectorMinimumF32/CreateVectorMaximumF32 fallback with opcode 0x15f. Symmetric saturates to [-bound, +bound] (zero-point 0) — the int8/fp8 device saturate. Asymmetric saturates to [lo, hi] with lo ≠ -hi — the with-zero-point / unsigned host path.

VreducePrecision (0x1d5697c0, called from DoOutputFusion and HandleReducePrecision) is the bit-manipulation round-to-nearest-even that truncates an f32 mantissa to bf16: add the rounding bias to the mantissa, mask the low bits, fix the tie-to-even and the NaN/Inf exponent. The LLO op is llo.vround.nearest_even.f32 (VroundF32 0x1d525480).


The Collective Symmetric Quantizer (pincer_utils)

Purpose

The quantized all-reduce quantizes each shard to 8 bits before the wire write, reduces in 8-bit-on-wire with a running scale, and dequantizes at the end. These xla::jellyfish::pincer_utils helpers are the cleanest in-binary reference for symmetric absmax 8-bit quantization, and a reimplementer can lift the scale formula directly.

Algorithm

function UpdateScale(dtype, builder, max_abs_addr, scale_addr):   // 0x137b75c0
    switch dtype:
      case S8(0x2):           dtype_max = 127.0        // dword_84A2A28
      case F8E4M3B11FNUZ(0x17): dtype_max = 30.0       // dword_84A27FC
      case F8E5M2(0x13):      dtype_max = 57344.0      // dword_84A2530
      default: LogFatal("quantized_type_bound != nullptr") + PrimitiveType_Name(dtype)
    m       = VimmF32(dtype_max)
    max_abs = Vld(max_abs_addr)                         // expected in VMEM (checked)
    scale   = VdivF32(m, max_abs)                       // scale_factor = qmax / absmax
    Vst(scale_addr, scale)

The decompile fixes the dtype switch and the three .rodata floats exactly, and — importantly — fixes the division operand order as VdivF32(dtype_max, max_abs), i.e. the stored scale factor is qmax / absmax (the quantize multiplier). Quantize is then q = round(x * scale_factor); dequantize is x = q / scale_factor. The per-dtype max constants:

.rodata addressValuedtypePrimitiveType
0x84a2a28127.0int8 max0x2
0x84a27fc30.0F8E4M3B11FNUZ max finite0x17
0x84a253057344.0F8E5M2 max finite0x13

The five-stage round trip:

HelperAddressRole
UpdateMaxLocalChunk0x137b73a0Running absmax: VldVunpackCF32 (if bf16) → VabsF32VmaxF32Vst
UpdateScale0x137b75c0absmax → scale (qmax/absmax); 3 dtype-max consts
SymmetricallyQuantizeShardInPlaceTo8Bits0x137b7740round(x*scale)VpackcB16VpackcB8 (×4 unrolled)
SymmetricallyDequantizeShardInPlace8Bit0x137b7fc0unpackVcvtS32ToF32/scaleVpackCBf16
ReduceSymmetricallyQuantized8BitShardInPlace0x137b8880f32 requant-reduce (dequant → merge → re-track absmax)
GetUnpackFunction0x137b8720dtype → unpack fn pointer

GetUnpackFunction (0x137b8720) maps S8(0x2)→VunpackCS8, F8E5M2(0x13)→VunpackCF8E5M2, F8E4M3B11FNUZ(0x17)→VunpackCF8E4M3B11, else a StatusOr error carrying the PrimitiveType name. The ring reduction (ReduceSymmetricallyQuantized8BitShardInPlace) is done in f32 — dequant → merge functor → re-track absmax — so the 8-bit format is only the wire representation, not the arithmetic. RotatedPincerQuantizedEmitter::kSupportedQuantizationTypes (0xae5b90c) = {2, 19, 23} = {S8, F8E5M2, F8E4M3B11FNUZ}.

NOTE — there is no zero-point in the device-side symmetric quantizers (pincer_utils + Vcvt*): the on-device fast paths are symmetric (zero_point == 0). Asymmetric / zero-point quantization exists only at the host-side MLIR quant-dialect layer.


Per-Tensor vs Per-Channel vs Sub-Channel Scale (Host)

On the device codegen path quantization is symmetric and the granularity is the shard / vector tile (one scale, implicit zero-point 0). The richer scale + zero-point model lives in the host-side MLIR quant dialect, fully linked in libtpu:

MLIR typeScaleZero-pointExtra fieldsGranularity
UniformQuantizedTypegetScale()getZeroPoint()Per-tensor
UniformQuantizedPerAxisTypegetScales()[]getZeroPoints()[]getQuantizedDimension()Per-channel
UniformQuantizedSubChannelTypegetScales()[]getZeroPoints()[]getBlockSizes()[], getQuantizedDimensions()[]Block-wise
CalibratedQuantizedTypegetMin()/getMax()(no scale yet)Calibration

These lower through mlir::quant::stablehlo::ConvertUniform{Quantize,Dequantize,Requantize,QuantizedDotGeneral,QuantizedConvolution,QuantizedAdd,QuantizedClipByValue,...}Op (ten distinct ConvertUniform*Op families, exposed as 40 QuantizedStablehloOpConversion::matchAndRewrite instantiations) to integer dot/conv + affine requant round((scale_in/scale_out)·(acc - zp_in)) + zp_out. The guard string "Cannot requantize while changing quantization_axis" confirms a requantize cannot move the per-channel axis. A TPU dynamic per-column int8 quantizer (convert_dynamic_quantize_opsdamax_output → per-column scale) is flag-gated by xla_tpu_experimental_enable_dynamic_int8_quantization.


Worked Example: int8 Matmul + Bias + ReLU → int8

HLO (post DotCanonicalizer + output fusion):
  fusion {
    conv = convolution(int8 act, int8 weights) : s32   // MXU, int32 accumulate
    bc   = broadcast(f32 bias)
    deq  = convert(conv) : f32
    add  = add(deq, bc)
    relu = maximum(add, 0)
    req  = convert(relu) : s8
  }

Emission per output chunk (inside MatrixMultiplyAccumulateFunctor):
  AccumulateOutputChunk:  int32 accumulator (VaddS32 over K-tiles)
  DoOutputFusion:
    VcvtS32ToF32(acc)                       // int32 -> f32 dequant
    HandleBroadcast(bias) -> OpaqueCopy     // bias splat to tile
    HandleAdd            -> VaddF32         // + bias
    HandleMaximum        -> VclampGezF32    // ReLU
    HandleConvert        -> VmulF32(.,scale) + VcvtF32ToS32(RNE)
                            + VclampSymmetricF32(127) + VpackcB16/VpackcB8   // requant s8
    Vst                                     // store int8 output

What Is Not Decoded

  • The host-side affine-requant constants (M0, shift) inside ConvertUniformRequantizeOp / ConvertUniformQuantizedDotOp: the pattern symbols and the structural formula are recovered; the matchAndRewrite was not unwound op-by-op (upstream open-source legalization).
  • The TPU DynamicQuantize custom-op emitter body (per-column damax → scale → int8) and the xla::jellyfish::QuantizationConfig proto field layout.
  • The VpackFormat ordinals distinguishing E5M2 / E4M3Fn / E4M3B11 inside 0x126 CreateVectorPack (the three bf16→fp8 pack methods are named; the enum values were not bound here — see Pack/Unpack Precision).
  • The exact bit layout of VcvtF32ToNarrowFloat's software RNE composite per fp8 format (exponent bias, subnormal flush, overflow-to-Inf vs saturate).
  • The VclampAsymmetric lo/hi operand provenance (quant-type zero-point vs explicit fusion clip constants): confirmed asymmetric (lo ≠ -hi) but the operand origin not traced.

Cross-References

  • Pack/Unpack Precision — the VpackFormat enum, set_pack_format_sublane, and the bf16↔f32 lane pack/unpack the quant pack/unpack helpers share
  • VPU (Vector-ALU) Slot — the VALU slot carrying the vcvt/vclamp/vadd opcode immediates and the .sr stochastic-round convert family
  • EUP / Transcendental SlotSupportsBf16AluInstructions and the lane-width model behind the bf16 accumulate and unpack
  • XLU Op Roster — the broader XLU op-to-factory table this convert/clamp family sits beside
  • Bundle Model — the VLIW bundle the convert/clamp/add ops schedule into per generation