Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Pack/Unpack Precision

All addresses on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build-id 89edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, not stripped). .text and .rodata VMAs equal their file offsets; .data.rel.ro VMA minus 0x200000 equals its file offset. Other libtpu builds will differ.

Abstract

This is the XLU's precision-conversion datapath: the family of LLO ops that move data between f32 and the narrow 16-bit format (bf16) that is packed two-per-lane inside a vector register, plus the per-segment cross-lane reduction that consumes a packed-and-widened embedding tensor. Two distinct hardware mechanisms live here, and the binary keeps them on separate factory chains a reimplementer must not conflate.

The first is lane pack/unpack. A bf16 is bit-identical to the top 16 bits of an IEEE f32, so a 2-element bf16 vector fits in one 32-bit lane and the f32 widen is a pure position-and-zero — no arithmetic rounding. VpackBf16 interleaves two bf16 vregs into one packed lane (kVectorPack 0x126, VpackFormat = 7 InterleavedBf16); the inverse VunpackUpperCF32 / VunpackLowerCF32 select one bf16 half back out (kVectorUnpack 0x109, CompressedBf16, with a uint upper/lower index); and the composite VunpackF32 widens both halves to full f32 by shifting the low lane left 16 and masking the high lane to 0xffff0000. The format and the unpack sub-lane index are packed into one 64-bit instruction word at +0x40. This is the LLO level the EUP AluEp UnpackFOp/PackFOp ultimately lower to (see EUP / Transcendental Slot).

The second is the segmented reduction (the per-segment cross-lane reduce). Three F32-only opcodes — kVector{Max,Min,Add}SegmentReduceF32 0xfa/0xfb/0xfc — reduce within lane segments whose boundaries are programmed by a separate kVectorSetSegmentPattern (0x8c) op built by Vsetspr. It is the TensorCore analogue of an embedding segment-sum: the accumulator resets at each segment boundary the pattern register marks. The RPU re-emit pass fuses a pair of segment reduces by sharing one pattern setup. On the device codegen path this is a Jellyfish/Pufferfish primitive only — Viperfish and Ghostlite return false from SupportsSegmentedReduce and their TensorCore emitters are dead LogFatal stubs, the per-segment embedding reduce having moved to the SparseCore on those generations.

For reimplementation, the contract is:

  • The bf16 pack: VpackBf16kVectorPack 0x126, two operands, VpackFormat = 7 InterleavedBf16, gated by SupportsVectorPackOps(7).
  • The bf16 unpack: Vunpack{Upper,Lower}CF32kVectorUnpack 0x109, one operand, VpackFormat = 1 CompressedBf16, uint index 1 (upper) / 0 (lower), gated by SupportsVectorUnpackOps.
  • The f32 widen: VunpackF32 = { CastTo(f32, lower<<16), CastTo(f32, upper&0xffff0000) }.
  • The instruction-word format field at +0x40: VpackFormat in bits 0-15, sub-lane/unpack index in bits 16-31, with the has-value / fan-in checks per opcode class.
  • The segment-reduce opcode set {0xfa,0xfb,0xfc} (F32 only), the kVectorSetSegmentPattern 0x8c boundary program, the RPU shared-pattern dispatch, the per-gen VEX-opcode map, and the JF/PF-only support gate.
bf16 packVpackBf16 (0x1d554680) → VpackBf16Inst (0x1d565940) → CreateVectorPack (0x1d4d3140) → kVectorPack 0x126, InterleavedBf16 (7)
bf16 unpackVunpack{Upper,Lower}CF32 (0x1d567f20/0x1d567e20) → CreateVectorUnpack (0x1d4d37c0) → kVectorUnpack 0x109, CompressedBf16 (1), index 1/0
f32 widenVunpackF32 (0x1d554620); lower <<16 (SimplifyShllU32), upper &0xffff0000 (SimplifyAndU32)
Format wordset_pack_format_sublane (0x1d4d3440) → WORD[instr+0x40] bits 0-15 = VpackFormat, bits 16-31 = sub-lane index
Segment reducekVector{Max,Min,Add}SegmentReduceF32 0xfa/0xfb/0xfc (F32 only); LloOpcodeIsSegmentedReduction (0x1d60c340) = (op-0xfa)<3
Segment patternVsetspr (0x1d52ba60) → CreateVectorSetSegmentPattern (0x1d4d64a0) → kVectorSetSegmentPattern 0x8c (no mode field)
Segment-reduce supportJF / PF only; VF / GL SupportsSegmentedReduce = false, TC emitter is LogFatal

bf16 Pack: InterleavedBf16, Two Lanes into One

Purpose

A reimplementer's entry point is "two bf16 vregs → one packed f32-width vreg". This is the format the EUP/XLU pipeline moves bf16 reductions through, because reducing a packed embedding row in f32 requires first widening it (next section), and storing the result as bf16 requires re-packing it.

Algorithm

VpackBf16(a, b) (0x1d554680) is a peephole front then a tail-call to the real factory:

function VpackBf16(a, b):                               // 0x1d554680
    folded = SimplifyPackF16(a, b, RegisterType=4)      // 0x1d599360, const-fold two bf16 constants
    if folded: return folded
    return VpackBf16Inst(a, b)                          // 0x1d565940 (tail)

function VpackBf16Inst(a, b):                           // 0x1d565940
    target = this->module->target                       // *(this+56)->+16
    if not target->vtable[+0x4c8](7):                   // SupportsVectorPackOps(format=7)
        UpdateStatus("target().SupportsVectorPackOps(format)")   // llo_region_builder.cc:13132
    inst = CreateVectorPack(0x126, 7, a, b, region)     // 0x1d4d3140
    return region->AppendInstruction(inst)

The decompile shows VpackBf16Inst calling CreateVectorPack(294, 7u, …)294 = 0x126 = kVectorPack, format 7 = InterleavedBf16 — after the gate (*(…+1224))(v7, 7) where 1224 = 0x4c8 is the target sub-object vtable slot for SupportsVectorPackOps. CreateVectorPack (0x1d4d3140) builds a two-operand New(0x126, {a,b}) and calls set_pack_format_sublane(7, nullopt).

NOTE — the bf16 pair-pack is not kVectorWeird (0xae). The 0xae op is the distinct one-operand VweirdBf16 factory (0x1d5546e0CreateVectorWeird 0x1d4d4e20), a single-input cross-lane op gated on SupportsBf16AluInstructions (vtable +0x780); the decompile passes a fixed result PrimitiveType 0x10 (F16) to CreateVectorWeird, which builds New(0xae, {in}, 1). The two share no factory.

Function Map

FunctionAddressRole
LloRegionBuilder::VpackBf160x1d554680Peephole + tail-call entry
LloRegionBuilder::VpackBf16Inst0x1d565940Gate + emit 0x126
LloInstruction::CreateVectorPack0x1d4d3140New(0x126, {a,b}) + format
llo_simplifier::SimplifyPackF160x1d599360Constant-fold two bf16
LloRegionBuilder::VweirdBf160x1d5546e0Distinct 1-op 0xae factory

bf16 Unpack: CompressedBf16 with an Upper/Lower Index

Purpose

The inverse of the pack: pull one of the two bf16 halves back out of a packed lane. The CF32 form keeps the result in the compressed (16-bit-lane) domain; the F32 form (next section) widens it to a full 32-bit float.

Algorithm

function VunpackUpperCF32(in):                          // 0x1d567f20
    folded = SimplifyUnpack(index=1, format=1, in, 0)   // 0x1d593060
    if folded: return folded
    target = this->module->target
    if not target->vtable[+0x4d8](1, 0):                // SupportsVectorUnpackOps(format=1, TpuCoreType=0)
        UpdateStatus("target().SupportsVectorUnpackOps(format)")  // llo_region_builder.cc:13740
    inst = CreateVectorUnpack(0x109, index=1, format=1, in, region)   // 0x1d4d37c0
    return region->AppendInstruction(inst)

function VunpackLowerCF32(in):                          // 0x1d567e20
    // identical, index = 0; llo_region_builder.cc:13730

The decompile is unambiguous: VunpackUpperCF32 calls CreateVectorUnpack(0x109u, 1, 1u, …) and VunpackLowerCF32 calls CreateVectorUnpack(0x109u, 0, 1u, …) — opcode 0x109 (kVectorUnpack), index 1/0, format 1 (CompressedBf16). The gate vtable slot +0x4d8 (1240) is SupportsVectorUnpackOps, called with (1, 0). Upper = index 1, lower = index 0.

CreateVectorUnpack (0x1d4d37c0) builds a one-operand New(0x109, {in}) and calls set_pack_format_sublane(format, index | engaged) — the index is folded into the sub-lane field (next section).

The bf16 unpack opcode is 0x109 kVectorUnpack, distinct from 0x10f kVectorDynamicUnpack. 0x10f is the separate CreateVectorDynamicUnpack (0x1d4d3d80), gated by SupportsDynamicUnpackOps (vtable +0x4e0, false on Viperfish). The sibling sub-byte unpacks kVectorUnpackAndJoinB2ToB4/B4ToB8 (0x10a-0x10d) and kVectorUnpackEXMY (0x10e) are distinct opcodes again.

Function Map

FunctionAddressEmits
LloRegionBuilder::VunpackUpperCF320x1d567f200x109, idx 1, fmt 1
LloRegionBuilder::VunpackLowerCF320x1d567e200x109, idx 0, fmt 1
LloInstruction::CreateVectorUnpack0x1d4d37c0New(0x109, {in}) + format
LloInstruction::CreateVectorDynamicUnpack0x1d4d3d800x10f (separate path)
llo_simplifier::SimplifyUnpack0x1d593060Const-fold peephole

The f32 Widen: Shift the Low Lane, Mask the High Lane

Purpose

bf16 is the top 16 bits of an f32. Widening therefore needs no rounding — only repositioning. VunpackF32 produces the pair of widened f32 lanes from one packed bf16 vreg.

Algorithm

function VunpackF32(in):                                // 0x1d554620
    lo = CastTo(f32 /*0x12*/, VunpackLowerF32(in))      // 0x1d528660
    hi = CastTo(f32 /*0x12*/, VunpackUpperF32(in))      // 0x1d528580
    return { lo, hi }                                   // composite pair

function VunpackLowerF32(in):                           // 0x1d528660
    u = CreateVectorUnpack(0x109, index=0, format=7 /*InterleavedBf16*/, in)
    return SimplifyShllU32(u, VectorU32Constant(16))    // 0x1d58f8a0 : low bf16 -> high 16 of f32, low 16 = 0

function VunpackUpperF32(in):                           // 0x1d528580
    u = CreateVectorUnpack(0x109, index=1, format=7, in)
    return SimplifyAndU32(VectorU32Constant(0xffff0000), u)  // 0x1d58dac0 : high bf16 already in place, mask low 16

The decompile confirms each step: VunpackLowerF32 builds VectorU32Constant(0x10) and calls SimplifyShllU32 (shift left 16); VunpackUpperF32 builds VectorU32Constant(0xFFFF0000) and calls SimplifyAndU32 (mask). The lower lane's 16 bf16 bits become the high half of an f32 with a zero mantissa-low; the upper lane is already in the high half, so it is masked clean. The result is exact — the round-trip pack/unpack is bit-preserving (bf16 is the f32 high half).

NOTE — the VunpackF32 family widens with VpackFormat = 7 InterleavedBf16 (the interleaved layout, fan-in 2), whereas the CF32 family uses VpackFormat = 1 CompressedBf16. The format selects how the two bf16 lanes are laid out in the source word; the widen arithmetic (shift/mask) is the same either way.


The Instruction-Word Format Field (+0x40)

Purpose

Both pack and unpack stash their format (and, for unpack, the sub-lane index) in one 64-bit field of the LLO instruction. A reimplementer encoding these ops must reproduce the bit layout and the per-opcode validity checks.

Encoding

set_pack_format_sublane(VpackFormat fmt, optional<u16> sublane) (0x1d4d3440) writes QWORD[instr+0x40]. The decompile switches on the opcode:

function set_pack_format_sublane(instr, fmt, sublane):  // 0x1d4d3440
    switch instr.opcode:
      case 0x109, 0x10E:                                // unpack / unpackEXMY -> sublane REQUIRED
          assert sublane.has_value()                    // llo_instruction.cc:3714
          assert *sublane < VpackFormatSublanesIndices(fmt)   // :3715  (index < lane fan-in)
          word = (word & ~0xFFFF0000) | (sublane << 16)
      case 0x10F, 0x126:                                // dynamic-unpack / pack -> sublane FORBIDDEN
          assert not sublane.has_value()                // llo_instruction.cc:3723
      default:
          LogFatal("unexpected instruction: <opcode>")  // :3726
    word = (word & ~0xFFFF) | fmt                       // VpackFormat in bits 0-15
    instr[+0x40] = word

So bits 0-15 = VpackFormat, bits 16-31 = sub-lane/unpack index. The unpack opcodes (0x109, 0x10e) require a sub-lane index and check it is below the format's lane fan-in; the pack opcodes (0x126, 0x10f) forbid one. VpackFormatSublanesIndices(fmt) (0x1d629d60) is a one-line return dword_B53C790[fmt] — a direct index into a .rodata i32 table at 0xb53c790 giving the number of source lanes that fan into one packed slot (2 or 4 per format).

VpackFormat (selected values)

The VpackFormat enum names come from VpackFormatString (0x1d629960). The values directly on the bf16↔f32 datapath:

ValueNameLane fan-inRole
0(invalid)Sentinel / unset
1CompressedBf162CF32 unpack source
7InterleavedBf162VpackBf16 pack format; VunpackF32 widen
11CompressedHf162f16 (half) compressed
19-22Compressed{U8,S8,U4,S4}ToBf162 / 4int→bf16 dequant (embedding quant)

The full 26-entry enum extends through the fp8 (F8E5M2/F8E4M3*) and sub-byte dequant formats; those are the province of the quant pack/unpack path. See Bias-Add & Quant/Dequant Helpers.

Per-Generation Pack/Unpack Capability Masks

The gate vtable slots (+0x4c8 pack, +0x4d8 unpack, +0x4e0 dynamic-unpack) on the target's +0x10 sub-object are byte-exact bitmask predicates:

Target / methodAddressPredicateSupported formats
ViperfishTarget::SupportsVectorPackOps0x1d49b1a0(fmt-1) < 0xA1..10
ViperfishTarget::SupportsVectorUnpackOps0x1d49b1c0fmt < 0xE && (0x39FE >> fmt)&11,2,3,4,5,6,7,8,11,12,13
GhostliteTarget::SupportsVectorPackOps0x1d498020fmt < 0x17 && (0x7807FE >> fmt)&11..10, 19..22
GhostliteTarget::SupportsVectorUnpackOps0x1d498040fmt < 0x17 && (0x7839FE >> fmt)&11..13, 19..22

InterleavedBf16 (7) is in both generations' pack set and CompressedBf16 (1) in both unpack sets, so the bf16↔f32 datapath is universally available; Ghostlite additionally admits the U8/S8/U4/S4→bf16 dequant pack formats (embedding quantization).


Segmented Reduction: Per-Segment Cross-Lane Reduce

Purpose

A segment reduce is an embedding-style aggregation: the lanes of a vreg are partitioned into segments, and the reduce produces one accumulator per segment, resetting at each boundary. The boundary pattern is supplied by a separate op. A reimplementer must model this as a two-op sequence — set the pattern, then reduce against it.

Algorithm

The opcode set is exactly three, F32-only:

bool LloOpcodeIsSegmentedReduction(op):                 // 0x1d60c340
    return (uint16)(op - 0xfa) < 3                       // {0xfa, 0xfb, 0xfc}
  • 0xfa kVectorMaxSegmentReduceF32
  • 0xfb kVectorMinSegmentReduceF32
  • 0xfc kVectorAddSegmentReduceF32

There is no bf16 segment reduce: the surrounding reduce family is 0xf5-0xf9 (plain F32 Min/Max/Add/MaxIndex/MinIndex), then 0xfa-0xfc (F32 segment), then 0xfd-0x101 (plain bf16 reduces); 0x102 (kVectorSublaneId) is the next non-reduce op.

The boundary register is programmed by Vsetspr:

function Vsetspr(pattern_src, xlu, source_bus):          // 0x1d52ba60
    inst = CreateVectorSetSegmentPattern(pattern_src, xlu, source_bus, region)   // 0x1d4d64a0
    return region->AppendInstruction(inst)

function CreateVectorSetSegmentPattern(pattern, xlu, bus, region):   // 0x1d4d64a0
    assert opcode_produced_register_type[pattern.opcode] == 4   // "pattern->ProducesVreg()", llo_instruction.cc:918
    inst = New(0x8c /*kVectorSetSegmentPattern*/, {pattern})     // one operand
    ValidateAndSetXluAndSourceBus(xlu, bus, inst)               // XLU index -> WORD[instr+0xb]
    set_annotation(inst)
    return inst

CreateVectorSetSegmentPattern (0x1d4d64a0) asserts the pattern source ProducesVreg() (opcode_produced_register_type == 4), builds the one-operand New(0x8c, {pattern}), and stamps the XLU/source-bus into WORD[instr+0xb] via ValidateAndSetXluAndSourceBus (0x1d4d5180: bits 8-9 = xlu & 3, bit 10 0x400 = the XLU-op flag).

NOTE — kVectorSetSegmentPattern (0x8c) carries no mode field, unlike its sibling kVectorSetPermutePattern (0x8b), which stores a SetPermuteMode ({one_sublane=0, one_sublanes=1}) at DWORD[instr+0x40]. The segment pattern is a pure per-lane segment-boundary program; the permute pattern additionally selects a permute granularity. See XLU Op Roster.


The RPU Re-emit: Sharing One Pattern Setup

Purpose

The reorder/pipeline (RPU) re-emit pass fuses adjacent XLU reduces so a pair of segment reduces shares one Vsetspr pattern setup — the per-segment cross-lane reduce economy. A reimplementer's scheduler must replicate this fusion to match the emitted bundle stream.

Algorithm

ReemitReorderedCombinedXluOperations caches one pattern result per XLU in a per-XLU state record, then rewires the fused reduce operands onto it:

// per source producer:
if producer is SetPermutePattern (0x8b):
    Vsetperm(mode = variant[+0x40], xlu = rpu[+0x20]); cache -> PerXluState[+0x08]
if producer is SetSegmentPattern (0x8c):
    Vsetspr(xlu = rpu[+0x20]);                          cache -> PerXluState[+0x18]

// reduce body:
if LloOpcodeIsSegmentedReduction(op):                   // segment path
    ReplaceOperandIn(op_a, shared_segment_result)       // DOUBLE rewire ...
    ReplaceOperandIn(op_b, shared_segment_result)       // ... both fused operands
else:                                                    // plain reduce path
    ReplaceOperandIn(op, shared_permute_result)         // single rewire
PopInstruction(); AppendInstruction()                   // re-home both ops

The segment path does a double ReplaceOperandIn (both fused-reduce source operands rewire onto the shared Vsetspr segment-pattern result); the plain reduce path rewires a single operand onto the shared Vsetperm permute result. This is how two segment reduces on the same XLU collapse to one pattern program plus two reduces.


Per-Generation TensorCore Emit and Support

Purpose

The LLO segment ops map to VEX-control opcodes the TensorCore bundle understands, but only on Jellyfish and Pufferfish. A reimplementer targeting Viperfish or later must route per-segment embedding reduction to the SparseCore instead.

Encoding

JellyfishEmitter::EmitVectorSegmentedReduce (0x140b6c80) maps the LLO opcode to a VEX opcode and emits it through EmitVectorExtendedInstruction (one Vregno data operand scheduled into a VEX bundle slot):

function EmitVectorSegmentedReduce(op, src, trf_id):     // 0x140b6c80
    CHECK(trf_id == 0)                                   // jellyfish_emitter.cc:1248
    switch op:
      case 0xfa: vex = 0x1f                              // kVectorMaxSegmentReduceF32
      case 0xfb: vex = 0x20                              // kVectorMinSegmentReduceF32
      case 0xfc: vex = 0x1e                              // kVectorAddSegmentReduceF32
      default:   NoteError("unhandled opcode for vector segmented reduce: %s")
    EmitVectorExtendedInstruction(this, vex, src, 0)

function EmitVectorSetSegmentPattern(src, spr_id):       // 0x140b58e0
    CHECK(spr_id == 0)                                   // jellyfish_emitter.cc:1027
    EmitVectorExtendedInstruction(this, 0xe, src, 0)     // VEX opcode 0xe

The decompile confirms the literal VEX values: 0xfa→0x1f (31), 0xfb→0x20 (32), 0xfc→0x1e (30), SetSegmentPattern→0xe (14). The plain reduce family 0xf5-0xf9 maps to VEX {0x16,0x15,0x14,0x17,0x18} (a separate cross-lane-reduce emitter). The .seg.perm cross-lane mnemonics (vadd.xlane.seg.perm, vmax.xlane.seg.perm, vmin.xlane.seg.perm) in .rodata confirm the segment-pattern-driven reduce semantics.

LLO opcodeNameJF VEX opcode
0x8ckVectorSetSegmentPattern0xe
0xfckVectorAddSegmentReduceF320x1e
0xfakVectorMaxSegmentReduceF320x1f
0xfbkVectorMinSegmentReduceF320x20

The JF/PF-Only Support Gate

SupportsSegmentedReduce is a one-line per-target predicate:

TargetAddressReturnsTC emitter behaviour
JellyfishTarget0x1d4909c01 (true)Emits VEX 0xe/0x1e/0x1f/0x20
PufferfishTarget0x1d494f801 (true)Emits (no partial-result drain — SupportsSegmentedReducePartialResults false)
ViperfishTarget0x1d49b3800 (false)LogFatal "Operation not supported."
GhostliteTarget0x1d4981c00 (false)LogFatal "Operation not supported on Ghostlite."

Both the Viperfish (ViperfishTensorCoreEmitter::EmitVectorSegmentedReduce 0x141dd2c0) and Ghostlite (GhostliteTensorCoreEmitter::EmitVectorSegmentedReduce 0x1429ff60) TensorCore segment-reduce emitters are __noreturn LogMessageFatal stubs. On those generations the per-segment embedding aggregation runs on the SparseCore VectorExtended unit, not the TensorCore XLU — the segment-reduce became a JF/PF legacy primitive.


Worked Example: Reduce a Packed bf16 Embedding Row in f32

A bf16 embedding tensor lives two-bf16-per-32-bit-lane (InterleavedBf16). To reduce it in f32:

  1. WIDEN   VunpackF32(packed) -> { CastTo(f32, VunpackLowerF32), CastTo(f32, VunpackUpperF32) }
               lower: kVectorUnpack(0x109, idx0, fmt7) -> <<16   (low bf16 -> high 16 of f32)
               upper: kVectorUnpack(0x109, idx1, fmt7) -> &0xffff0000

  2. REDUCE  (JF/PF) Vsetspr -> kVectorSetSegmentPattern(0x8c) programs the per-lane boundary;
               kVectorAddSegmentReduceF32(0xfc -> VEX 0x1e) sums each segment, resetting at boundaries.
               A second reduce on the same XLU SHARES the one Vsetspr (RPU double ReplaceOperandIn).

  3. RE-PACK VpackBf16(lo, hi) -> kVectorPack(0x126, InterleavedBf16) interleaves the f32 halves
               back to one packed bf16 lane for storage.

  On Viperfish/Ghostlite step 2 has no TensorCore path: the segment reduce runs on the SparseCore.

What Is Not Decoded

  • The exact middle VpackFormat enum indices (the CompressedB8/B4/B2/B1 arms and the fp8/sub-byte dequant arms 12-25) are named in VpackFormatString; their per-arm byte-overlap offsets are not pinned here.
  • The VpackFormatSublanesIndices fan-in semantics for the dequant formats (12-25): the 2/4 table values are dumped, but whether 4 means "4 sub-byte elements per slot" vs "4-bit element width" is not separated per format.
  • The per-gen VEX bundle slot the segment-reduce / SetSegmentPattern op lands in (EmitVectorExtendedInstruction schedules via CurrentBundle/GetPopulatedSlots/FindFreeSlot).
  • The VectorExtendedOpcode proto enum symbolic names per value (descriptor 0x1fa1fd00): the VEX values are listed but the proto member names are not pulled from the serialized FileDescriptor.
  • The host-side embedding lowering that emits the SetSegmentPattern + SegmentReduce pair (how an HLO segment-sum's offsets become the segment-id pattern source).

Cross-References

  • VPU (Vector-ALU) Slot — the VALU slot that carries the vpack/vunpack opcode immediates and the convert family
  • EUP / Transcendental Slot — the AluEp UnpackFOp/PackFOp and SupportsBf16AluInstructions lane-width model these LLO ops lower from
  • XLU Op Roster — the Vsetperm/Vsetspr set-pattern factories and the permute-vs-segment distinction
  • Bias-Add & Quant/Dequant Helpers — the fp8/int8 pack formats (12-25) and the quant pack/unpack helpers that share set_pack_format_sublane
  • Bundle Model — the per-generation VLIW bundle the VEX-control segment ops schedule into