Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Memory Bandwidth & Latency Model

Every offset, value, and address on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). The binary is not stripped — every symbol below is a demangled C++ name. Other versions differ. All addresses are virtual addresses; for this binary .rodata VMA == file offset (0x84a0000), and the rodata doubles quoted here were decoded from .rodata at those VAs.

Abstract

This page documents the memory-side cost model: the half of the TPU cost model that prices a DMA by how many TensorCore cycles it costs to move N bytes through a particular memory tier. Where WindowDescription Byte-Cost turns a windowed access into a byte count, this page turns that byte count into cycles. The model has exactly two terms, summed:

dma_cycles = init_latency_cycles  +  transferred_bytes / bytes_per_cycle
             └─ fixed startup ─┘     └────── bandwidth-proportional ──────┘

The startup term is a flat per-(destination-tier) constant — the InitialDmaLatencyInNs virtual, scaled to cycles by DefaultHbmInitLatency. The bandwidth term is bytes / bytes_per_cycle, where bytes_per_cycle is derived once from the full-chip HBM (or CMEM) byte rate divided by the TensorCore clock and core count. The two terms are deposited into two independent ResourceVector lanes and the scheduler maxes the lanes: a transfer is latency-bound when the fixed startup dominates and bandwidth-bound when bytes / bytes_per_cycle dominates. This per-lane max is the model's roofline. On v5p and newer there is a second, separate two-lane computation inside WindowCycles itself — the bandwidth is priced twice, once for an HBM byte count and once for a VMEM byte count, and the slower of the two wins — but that block is opt-in (only the convolution classic-window path supplies both byte counts); the generic priced DMA leaves it dormant.

The familiar reference frame is a textbook roofline: time = max(latency, size / bandwidth). The divergences a reimplementer must respect are (1) the unit is TensorCore cycles, not seconds — both terms are pre-multiplied by TC frequency so they compose with the CycleTable throughput integers; (2) the startup latency is charged once per resource lane, not once per transfer, so a chain of DMAs into the same lane pays one startup; and (3) the per-tier rate constants are per-generation virtual overrides returning std::optional<double>, where an absent value means "this tier pair is not modelled" rather than "zero bandwidth."

For reimplementation, the contract is:

  • The per-(dst tier) InitialDmaLatencyInNs constants (HBM/VMEM/SMEM/CMEM), per generation, and their three distinct dispatch shapes (constant / 2-entry table / VMEM-vs-other branch / LogFatal base).
  • DefaultHbmInitLatency = InitialDmaLatencyInNs(elem, bytes) · (TC_MHz / 1000.0) — the ns→cycle conversion, and the MemUnitFromKiB(0) sentinel that short-circuits a no-transfer MemUnit.
  • The byte→cycle conversion: MemUnitFromKiB (KiB→granule-unit shift) and ChunkGranules ((ChunkCellCount·4) / ChunkGranuleBytes), the granule that windowing_util::Size rounds up to.
  • bytes_per_cycle = FullChipBytesPerSecond / (TC_MHz · 1e6) / CoresPerChip, from GetBytesPerCycle, including the kHbm/kCmem-only CHECK and the two env/backend-config overrides.
  • The roofline assembly in WindowCycles: the startup init add, the (hbm_b | vmem_b) opt-in gate around the per-direction HBM/VMEM max, and the inner v5p+ (Target+0x398 >= 5) gate.
  • The per-tier LocalDmaBandwidth<Src>To<Dst> GB/s matrix that feeds the copy-mode decision (ShouldUseAsyncLocalCopy), distinct from the cycle-cost bytes_per_cycle.
Startup latency (cycles)fusion_util::DefaultHbmInitLatency(MemUnit, WindowDescription, Target) @0x14552ca0
Per-tier startup (ns)Target::InitialDmaLatencyInNs(MemorySpace, double) @0x1d61c880 (base LogFatal)
Roofline assemblyfusion_util::WindowCycles(MemUnit, WindowDescription, Target, double, long, long, long) @0x14552660
Bytes/cycle budgetfusion_util::GetBytesPerCycle(HloInstruction, Target, MemorySpace) @0x1454dd00
KiB→granule conversionTarget::MemUnitFromKiB(long) @0x1d61bf20
GranuleTarget::ChunkGranules() @0x1d61a440 = (ChunkCellCount·4) / ChunkGranuleBytes
Full-chip HBM rateTarget::HbmFullChipBytesPerSecond() @0x1d6172a0 (= Target+0x4f0)
Cores per chipTarget::CoresPerChip(TpuCoreType) @0x1d615b40
Per-tier GB/s matrixTarget::LocalDmaBandwidth(MemorySpace, MemorySpace) @0x1d6168e0 (dispatcher)
Copy-mode consumerShouldUseAsyncLocalCopy(HloInstruction, MemorySpace, MemorySpace, Target) @0x133eff40

The Two-Term Cost

Purpose

Every priced operand DMA — a conv kernel transfer, a reduce-window tile, any kCopy/kCopyStart priced by the cost model — deposits two cycle values into the ResourceVector: a latency value into the startup lane and a bandwidth value into the transfer lane. The two are deposited separately because the scheduler maxes lanes independently; the latency lane is the analogue of an LLVM WriteLatency's fixed component, the bandwidth lane the analogue of getMemoryOpCost's size-proportional term.

The arithmetic

// Per priced operand (input or output), inside RecordMemXferCyclesImpl:
granule        = Target.ChunkGranules()                            // @0x1d61a440
raw_bytes      = windowing_util::Size(wd.strides, MemUnit, granule)// align_up to granule
transfer_bytes = raw_bytes / (Compact2ndMinorRatio · ElementPackingFactor)

if rv[latency_lane] == 0.0:                                        // charged ONCE per lane
    rv.Acc(latency_lane, DefaultHbmInitLatency(MemUnit, wd, Target))   // @0x14552ca0
bytes_per_cycle = GetBytesPerCycle(hlo, Target, MemSpace)          // @0x1454dd00
rv.Acc(bw_lane, WindowCycles(MemUnit, wd, Target, bytes_per_cycle, 0, 0, 0))  // @0x14552660

The rv[latency_lane] == 0.0 guard is the once-per-lane rule: the DMA startup is billed only the first time a lane is touched in this op's cost, so N transfers into the same lane pay one startup, not N. The bandwidth deposit always accumulates. The input/output lane split (R9/R10 for input, R11/R12 for output) belongs to WindowDescription Byte-Cost, which prices the byte count this page consumes.

NOTE — the two terms are deposited into different lanes and never added together in a single number until the bundle-level MaxResourceCycles reduction. A transfer whose startup latency (e.g. 2280 cycles on v6e HBM) exceeds its bandwidth cycles is latency-bound: the latency lane dominates the bundle max. A large contiguous transfer is bandwidth-bound: the bytes / bytes_per_cycle lane dominates. This per-lane separation is the roofline.


Per-Tier Startup Latency — InitialDmaLatencyInNs

Purpose

Target::InitialDmaLatencyInNs(MemorySpace dst, double burst_size) returns the fixed nanosecond startup cost of beginning a DMA whose destination is the given tier. The burst_size argument is accepted but ignored by every override — the latency is purely a per-(dst tier) constant, never a function of transfer size. This is the value DefaultHbmInitLatency scales to cycles.

The per-tier constants

Decoded from the vmovsd xmm0, [rodata] immediates inside each per-gen override (all .rodata doubles verified byte-exact):

dst MemorySpaceJellyfish (v2/v3)Pufferfish (v4)Viperfish (v5p)Ghostlite (v6e)base
VMEM (MS=3)24055500LogFatal
CMEM (MS=4)2405012001200LogFatal
SMEM (MS=5)24055512001200LogFatal
HBM / any other24055512001200LogFatal

The three dispatch shapes, each confirmed byte-exact in the decompile:

// JellyfishTarget @0x1d48f3a0 — unconditional constant, no MemorySpace branch
return 240.0;                              // vmovsd xmm0, [0xa2de6a8] = 240.0

// PufferfishTarget @0x1d493d00 — 2-entry table, index = (dst == kCmem)
idx = (dst == 4);                          // sete al
return ((double*)0xa2dcd40)[idx];          // [555.0, 50.0]  → CMEM=50, else=555

// ViperfishTarget @0x1d499ca0 / GhostliteTarget @0x1d496dc0 — VMEM-vs-other branch
return (dst == 3) ? 0.0 : 1200.0;          // vxorps→0; if dst!=VMEM, [0xa2df9f8] = 1200.0

// base Target @0x1d61c880 — LogMessageFatal("Unimplemented") at target.h:1235
LogFatal();                                // never reached for a configured target

Latency rodata constants (decoded from .rodata):

VAdoublerole
0xa2de6a8240.0Jellyfish, all tiers
0xa2dcd40555.0Pufferfish default (table index 0)
0xa2dcd4850.0Pufferfish CMEM (table index 1)
0xa2df9f81200.0Viperfish / Ghostlite non-VMEM

GOTCHA — Pufferfish's CMEM destination is cheaper (50 ns) than every other tier (555 ns), the only tier where a non-HBM destination lowers startup cost. The 2-entry table at 0xa2dcd40 is indexed by the boolean dst == kCmem(4); a reimplementation that hardcodes a single PF latency will misprice every CMEM-targeted DMA by 11×. Conversely, on Viperfish/Ghostlite the only cheap tier is VMEM (0 ns) — every non-VMEM destination, including CMEM and SMEM, pays the full 1200 ns. The polarity of the cheap tier flips between v4 and v5p+.

QUIRK — the base Target::InitialDmaLatencyInNs is a LogMessageFatal("Unimplemented") at target.h:1235, not a default of zero. A MemorySpace/generation pair that reaches the base virtual aborts the compile. Every shipping generation overrides it, so this fatal is a guard against an unconfigured target, not a live code path.


ns → Cycles — DefaultHbmInitLatency

Purpose

DefaultHbmInitLatency is the cost model's adapter that turns the per-tier nanosecond constant into TensorCore cycles, so the startup latency composes with the cycle-denominated bandwidth and CycleTable terms. It is the value deposited into the latency lane.

Algorithm

function DefaultHbmInitLatency(MemUnit mu, WindowDescription wd, Target t):  // @0x14552ca0
    sentinel = t.MemUnitFromKiB(0)                  // @0x1d61bf20 — the "no transfer" MemUnit
    if mu == sentinel: return 0.0                   // vxorpd — short-circuit, no DMA
    CHECK(mu.amount_ != INT64_MAX)                  // target.h:211 guard
    elem = (wd.shape != null) ? (WORD[wd.shape+0xb] >> 2) & 0x1f : 0   // operand element type
    init_ns = t.InitialDmaLatencyInNs(elem, window_bytes)             // vtable[+0x20]
    tc_ghz  = t.TensorCoreFrequencyInMegaHertz() / 1000.0             // /0xa2e0430 = /1000.0
    return init_ns * tc_ghz                                           // ns · GHz = cycles

The conversion is exactly ns · GHz: dividing TC_MHz by 1000.0 yields GHz, and nanoseconds · gigahertz is dimensionless cycles. The /1000.0 divisor was confirmed at .rodata 0xa2e0430 = 1000.0. For a non-VMEM destination on v6e (InitialDmaLatencyInNs = 1200 ns, Ghostlite TC = 1750 MHz): 1200 · 1.75 = 2100 cycles deposited once into the latency lane. (The 1200 ns is byte-exact; the TC frequency is not a .rodata constant but a per-codename chip_parts field — Target+0x90c — whose value is itself recoverable: v6e = 1750 MHz, v7x = 1900 MHz, both CONFIRMED against the kDeviceTypeInfo tensorcore_clk_khz table (+0x50, ×1000 → Hz) and the per-codename chip_parts Core[TC].frequency_mhz row.)

GOTCHA — the MemorySpace argument to InitialDmaLatencyInNs here is not the literal tier enum — it is the operand's element-type field re-extracted from the shape pointer ((WORD[shape+0xb] >> 2) & 0x1f). For the per-tier latency this is benign because every override either ignores the argument (JF) or branches on a small set of values, but a reimplementation that passes the tier enum where the binary passes the element type will diverge on any future override that actually reads it.

The MemUnitFromKiB(0) sentinel

WindowCycles and DefaultHbmInitLatency both begin by comparing the incoming MemUnit against Target::MemUnitFromKiB(0). A MemUnit is a {amount_, granule_tag} pair; MemUnitFromKiB(0) is the canonical "zero bytes" MemUnit for the target's granule. When the priced access is this sentinel there is no transfer, and both functions return 0.0 immediately — the cost model's way of saying "this operand lives in a memory space that does not incur a DMA."


Byte → Cycle — MemUnitFromKiB and ChunkGranules

Purpose

The cost model counts bytes in granule units, not raw bytes: a transfer is always rounded up to a whole memory granule before it is divided by bandwidth. Two accessors define the granule and the KiB→granule conversion.

MemUnitFromKiB — @0x1d61bf20

function MemUnitFromKiB(this, long amount_kib):                 // @0x1d61bf20
    granule_bytes = this.vtable[+0x5c0]()                       // ChunkGranuleBytes (per gen)
    CHECK(granule_bytes <= 1024)                                // target.cc:3244
    CHECK(1024 % granule_bytes == 0)                            // target.cc:3245 — power-of-two divisor
    shift = log2(1024 / granule_bytes)                          // _BitScanReverse
    return amount_kib << shift                                  // KiB → granule-units

MemUnitFromKiB(kib) converts a kibibyte count into the target's native granule unit: kib · (1024 / granule_bytes), computed as a left shift because 1024 / granule_bytes is required to be a power of two (the two CHECKs enforce granule_bytes | 1024 and granule_bytes ≤ 1024). For amount_kib == 0 the result is 0 — the sentinel that DefaultHbmInitLatency/WindowCycles test against.

ChunkGranules — @0x1d61a440

function ChunkGranules(this):                                  // @0x1d61a440
    chunk_cell_bytes = 4 * topology[+0x1a8]                     // ChunkCellCount · 4 bytes/cell
    granule_bytes    = this.vtable[+0x5c0]()                    // ChunkGranuleBytes (per gen)
    return chunk_cell_bytes / granule_bytes                     // granules per chunk

ChunkGranules is the granule that windowing_util::Size rounds the transfer up to: the chunk-cell byte size (ChunkCellCount · 4) divided by the per-gen ChunkGranuleBytes. Size products the window's per-axis stride counts, ceil-divides by this granule, and multiplies back — billing any partial granule as a whole one. Both MemUnitFromKiB and ChunkGranules read the same per-gen ChunkGranuleBytes virtual at vtable[+0x5c0], so the byte-accounting unit is gen-consistent across the two accessors.

NOTE — the rounding granule is a chunk concept, not the DMA descriptor granule. Size's align_up(Product(strides), ChunkGranules) rounds the byte volume up to whole chunks; the DMA-fragment count (the {1.0, 1.6, 1.3, 1.1, 1.05} efficiency ratio) is a separate, orthogonal correction in WindowDescription Byte-Cost. A reimplementation that conflates the chunk granule with the descriptor count will double-correct.


Bytes/Cycle — GetBytesPerCycle

Purpose

bytes_per_cycle is the per-core, per-cycle byte budget that WindowCycles divides the transferred bytes by. It is the bandwidth denominator of the roofline, derived from the full-chip byte rate and the TC clock.

Algorithm

function GetBytesPerCycle(HloInstruction* hlo, Target t, MemorySpace ms):  // @0x1454dd00
    CHECK(ms == kHbm(1) || ms == kCmem(4))               // fusion_util.cc:2354
    cycles_per_sec = t.TensorCoreFrequencyInMegaHertz() * 1e6     // *0xa2e0208 = 1e6
    full_chip_bps  = (ms == kHbm) ? t.HbmFullChipBytesPerSecond() // @0x1d6172a0 (Target+0x4f0)
                                  : t.CmemFullChipBytesPerSecond()// @0x1d6172c0
    bpc = full_chip_bps / cycles_per_sec / t.CoresPerChip(0)      // @0x1d615b40
    if ms == kHbm:
        env = GetTpuCompEnv(hlo.GetModule())[+0x1040]             // env HBM bw override
        if env > 0: bpc = env                                    // vblendvpd select
        // per-instruction backend_config FlagConfig override (opcode-14 custom-call path):
        if hlo has FlagConfig_Value of type kDouble: bpc = that.double_value  // [+0x10]
    return bpc

The geometry is bytes_per_cycle = FullChipBytesPerSecond / (TC_MHz · 1e6) / CoresPerChip: the chip's aggregate memory rate, converted to bytes-per-second-per-cycle by dividing by the clock in Hz, then split across the chip's TensorCores. The 1e6 multiplier was confirmed at .rodata 0xa2e0208 = 1000000.0; HbmFullChipBytesPerSecond is a plain field read at Target+0x4f0 (populated from chip parts); CoresPerChip reads a per-TpuCoreType count from the topology.

SourceAddressMeaning
TensorCoreFrequencyInMegaHertz0x1d615b60TC clock (MHz); pure mov 0x90c(%rdi),%eax read of the Target+0x90C int32 field, populated from chip parts (per-codename value CONFIRMED: v6e 1750, v7x 1900)
HbmFullChipBytesPerSecond0x1d6172a0Target+0x4f0 field (full-chip HBM B/s)
CmemFullChipBytesPerSecond0x1d6172c0full-chip CMEM B/s (kCmem path)
CoresPerChip0x1d615b40per-TpuCoreType core count from topology
1e6 Hz multiplier0xa2e0208MHz → Hz

GOTCHA — GetBytesPerCycle is only valid for kHbm and kCmem; any other MemorySpace triggers LogFatal("m_space == MemorySpace::kHbm || m_space == MemorySpace::kCmem") at fusion_util.cc:2354. VMEM/SMEM transfers are not priced through this function — their cycle cost rides the HBM bytes_per_cycle denominator and the per-direction VMEM lane inside WindowCycles (below), not a VMEM-specific bytes_per_cycle. The two override hatches — the GetTpuCompEnv[+0x1040] chip-wide env value and the per-op FlagConfig double — let a caller pin bytes_per_cycle for benchmarking; a reimplementation that ignores them will diverge whenever the env or a custom-call backend config sets a non-zero HBM bandwidth.


The Roofline — WindowCycles

Purpose

WindowCycles is the public entry that assembles the full cost: it adds the startup latency, divides bytes by bytes_per_cycle, and — on v5p and newer — prices the HBM and VMEM directions separately and takes the slower one. Its return is the bandwidth-lane cycle value. The combination of the per-direction max and the startup add is the roofline.

Algorithm

function WindowCycles(MemUnit mu, wd, Target t, double bpc, long c, long hbm_b, long vmem_b):  // @0x14552660
    // source order is (mu, wd, t, bpc, c, hbm_b, vmem_b); bpc rides xmm0, the three longs r8/r9/stack
    if mu == t.MemUnitFromKiB(0): return 0.0           // sentinel — no transfer

    // startup-latency term (cycles)
    init = (c == -1) ? DefaultHbmInitLatency(mu, wd, t)             // @0x14552ca0
                     : (t.TensorCoreFrequencyInMegaHertz() / 1000.0) * c   // freq-scaled override

    count_desc = (t.vtable[+0x590]() == 1)             // per-gen "count descriptors" predicate
    base = WindowCyclesGenericTargetAgnostic(mu, wd, count_desc, bpc)  // @0x14552180 — the bytes/bpc divide
    cycles = base + init                               // VLOG-6 "original:" (fusion_util.cc:3435)

    if (hbm_b | vmem_b) == 0:                          // roofline OFF — the memory-xfer path (c=0)
        return cycles                                  // pure startup + bandwidth, single lane

    // roofline ON — only the conv classic-window path passes non-zero hbm_b/vmem_b
    CHECK(mu.amount_ != INT64_MAX)                     // target.h:211
    {hbm_lat, vmem_lat} = {t.InitialDmaLatencyInNs(kHbm=1,..), t.InitialDmaLatencyInNs(kVmem=3,..)}  // vtable[+0x20]
    if t[+0x398] < 5:                                  // < v5p (TpuVersion at Target+0x398)
        return (vmem_lat_scaled + hbm_lat_scaled) + max(dir0, dir1)   // single-lane shuffle/max
    else:                                              // v5p+ two-lane HBM/VMEM roofline
        // {hbm_bytes, vmem_bytes} / 1000.0 (xmmword_A2CE650), * TC_MHz, gated by >0.01
        hbm_cyc  = (hbm_b  / 1000.0) * TC_MHz
        vmem_cyc = (vmem_b / 1000.0) * TC_MHz
        return (vmem_lat_scaled + hbm_lat_scaled) + max(hbm_cyc, vmem_cyc)

The core divide transferred_bytes / bytes_per_cycle happens inside WindowCyclesGenericTargetAgnostic (the vdivsd against the bytes_per_cycle passed in xmm0) — see WindowDescription Byte-Cost for the fragment-ratio detail. WindowCycles always adds the startup init on top; the two-lane roofline is an opt-in block gated by (hbm_b | vmem_b) != 0 — when both byte counts are zero (the memory-xfer cost path, which passes 0, 0), WindowCycles returns base + init directly and the roofline block is skipped entirely.

The two-lane block (confirmed at @0x1455282e) is the explicit roofline; on v5p+ (Target+0x398 >= 5) it computes each direction independently:

  • It loads an {hbm_bytes, vmem_bytes} pair and divides each by 1000.0 (the xmmword_A2CE650 pair, confirmed [1000.0, 1000.0]), then multiplies by TC frequency — converting each direction's byte count to its own cycle estimate.
  • A vcmpnlepd against xmmword_A2D8020 ([0.01, 0.01]) gates each direction by a 1% percentage floor, vandpd-selecting whether that direction's latency contributes.
  • A vmaxsd takes the per-direction maximum — the slower of HBM and VMEM. This is the "two-lane roofline": a transfer that is HBM-bound and a transfer that is VMEM-bound are priced independently, and the op pays the worse.

The VLOG-6 trace at fusion_util.cc:3474 names every intermediate, confirming the model is a two-lane max: "hbm_percentage", "vmem_percentage", "hbm_bytes", "vmem_bytes", "mem_bw", "vmem_byte_rate", "hbmbw", "vmembw", "hbmlatency", "vmemlatency", and "next_gen" (the v5p+ marker, fusion_util.cc:3481).

rodata / vtablevalue / offsetrole
0xa2ce650[1000.0, 1000.0]per-direction byte→rate divisor (v5p+)
0xa2d8020[0.01, 0.01]per-direction 1% percentage gate (vcmpnlepd)
vtable[+0x590]== 1per-gen count-descriptors predicate
Target+0x398>= 5v5p+ two-lane gate (TpuVersion)

NOTE — the c (5th) argument selects the startup source: c == -1 uses DefaultHbmInitLatency; any other c uses TC_GHz · c as an explicit per-call startup count (so c == 0 yields a zero startup inside WindowCycles). RecordMemXferCyclesImpl passes c == 0 — it deposits the per-tier DefaultHbmInitLatency into a separate latency lane itself (see §The arithmetic) and uses WindowCycles only for the bandwidth term. The callers that pass c == -1 are permutation_util::IterateThroughWindowConfigs @0x145350c0 and SelectAndScatterEmitter::EmitR3plus @0x10e71320, where the startup is folded into the single returned cycle value. The conv classic-window path (SpatialMajorConvolution::CalculateClassicWindowCost @0x131626c0) is the only caller that passes non-zero hbm_b/vmem_b and so the only one that reaches the two-lane roofline block.


Per-Tier Bandwidth Matrix — LocalDmaBandwidth

Purpose

Distinct from the bytes_per_cycle cycle-cost denominator, the per-tier LocalDmaBandwidth<Src>To<Dst> virtuals return a raw GB/s for each ordered (source tier, destination tier) pair. These feed the copy-mode decision — whether a copy is issued as an async local DMA or kept synchronous — not the per-byte cycle cost. The dispatcher Target::LocalDmaBandwidth(src, dst) @0x1d6168e0 maps a (src, dst) pair to one of the 15 reachable per-tier virtuals; each returns std::optional<double> (the IEEE-754 bits in rax, the present flag in dl).

The matrix (GB/s)

Decoded from the return <ieee754> immediates in each per-gen override; = no override (base returns nullopt = not modelled). Viperfish columns show default / lite where the variant_name == "lite" gate applies (all immediates verified byte-exact):

Pair (Src→Dst)DragonfishPufferfishViperfish (def/lite)Ghostlite
HBM → HBM48072 / 30864
HBM → VMEM4234811198 / 8221285
HBM → SMEM3455 / 5655
HBM → CMEM(via VmemToVmem slot)(via VmemToVmem)(via VmemToVmem)(via VmemToVmem)
VMEM → HBM42311111224 / 8281432
VMEM → VMEM54472 / 82764
VMEM → CMEM1121
VMEM → SMEM3455 / 5655
CMEM → HBM1080
CMEM → VMEM2339
CMEM → CMEM1193
CMEM → SMEM34
SMEM → HBM3455 / 5655
SMEM → VMEM3455 / 5655
SMEM → CMEM34
SMEM → SMEM172828
SPMEM → HBM587.4588

Structural facts a reimplementer must respect:

  • No LocalDmaBandwidthHbmToCmem virtual exists. The dispatcher routes (Hbm, Cmem) into the VmemToVmem slot (vtable+0x1b8): HBM↔CMEM transfers are cost-modelled as VMEM-staged, so the HBM→CMEM leg is priced at the VmemToVmem rate.
  • The base Target returns 0 (nullopt) for every pair — confirmed at Target::LocalDmaBandwidthHbmToVmem @0x1d492580 returning 0. An absent value means "this tier pair is not modelled," and ShouldUseAsyncLocalCopy treats it as "do NOT route this as a local DMA," keeping the copy synchronous.
  • Dragonfish overrides only HBM↔VMEM (both 423, 0x407A700000000000); Jellyfish defines no LocalDmaBandwidth override of its own and inherits.
  • Viperfish gates on "lite" via an SSO-aware variant_name.size() == 4 && variant_name == "lite" compare (the 4-byte literal 0x6574696C, 'l','i','t','e'), confirmed at LocalDmaBandwidthHbmToVmem @0x1d49a380: default → 1198, lite → 822. Single-valued VF pairs (SmemToSmem = 28, SpmemToHbm = 587.4) do not gate.

Sample byte-exact confirmations:

DragonfishTarget::LocalDmaBandwidthHbmToVmem @0x1d48fa20 → 0x407A700000000000 = 423.0
GhostliteTarget::LocalDmaBandwidthHbmToVmem  @0x1d4973e0 → 0x4094140000000000 = 1285.0
ViperfishTarget::LocalDmaBandwidthHbmToVmem  @0x1d49a380 → "lite"? 822.0 : 1198.0
Target::LocalDmaBandwidthHbmToVmem (base)    @0x1d492580 → 0   (nullopt)

QUIRK — the GB/s matrix and the bytes_per_cycle budget are two different bandwidth numbers serving two different decisions. bytes_per_cycle (from GetBytesPerCycle, HBM/CMEM only) prices the cycle cost of a transfer once the cost model has decided to emit it. The LocalDmaBandwidth<Src>To<Dst> matrix decides whether a copy is emitted async at all, and feeds the collective-vs-local bisection-bandwidth threshold. A reimplementation that uses the GB/s matrix as the cycle-cost denominator (or vice versa) will produce a self-consistent but wrong cost model. The full dispatcher map and the copy-mode/bisection consumers are documented in Local DMA Bandwidth.


How the Scheduler Consumes the Latency/Bandwidth

The latency and bandwidth cycle values land in the ResourceVector and are reduced into the op's cost by the bundle-level MaxResourceCycles. The CostModelLatencyEstimator (the LatencyHidingScheduler client) then converts cycles to wall-clock seconds via cycles · trip_count / (TC_MHz · 1e6) — the same TC-frequency geometry GetBytesPerCycle uses, so the bandwidth term is dimensionally consistent end to end.

The copy-mode decision sits before this. ShouldUseAsyncLocalCopy @0x133eff40 reads the LocalDmaBandwidth(src, dst) GB/s matrix: if the optional is absent the copy stays synchronous; if present, for a multi-device collective it compares the local rate (scaled by replicas² and a ×4 factor) against the cross-chip bisection bandwidth to choose local-async vs collective-routed. That gate decides which DMAs exist; this page's latency/bandwidth cycle model prices the DMAs that survive it. The generic XLA memory_space_assignment::CostAnalysis::GetAsyncCopyElapsed uses a single flat config bandwidth (DefaultMemBandwidthBytesPerSecond @0x1dceb320), not this per-tier matrix — the per-tier model is TPU-specific and runs ahead of MSA's kAlternate/kDefault choice.


NameRelationship
WindowDescription Byte-Costbuilds the byte count this model divides by bytes_per_cycle; owns the R9–R12 lane routing
Local DMA Bandwidththe full LocalDmaBandwidth(src,dst) dispatcher map + copy-mode/bisection consumers
Cost Model Overviewthe three-family cost model and the cycles → seconds TC-frequency conversion
CycleTable Familythe throughput integers the bandwidth/latency cycles compose with in the bundle max
Resource Enum (23-slot)the ResourceVector lanes the latency and bandwidth values deposit into

Cross-References

  • WindowDescription Byte-Costwindowing_util::Size, the fragment ratio, and the R9/R10/R11/R12 lane split that consumes this bytes_per_cycle and DefaultHbmInitLatency
  • Local DMA Bandwidth — the per-(src,dst) GB/s dispatcher @0x1d6168e0, the HBM→CMEM-via-VmemToVmem routing, and ShouldUseAsyncLocalCopy
  • Cost Model Overview — the LatencyHidingScheduler cost client and the TensorCoreFrequencyInMegaHertz wiring shared with GetBytesPerCycle
  • CycleTable Family — the per-gen throughput backend that the memory-transfer cycles are maxed against
  • Resource Enum (23-slot) — the MaxResourceCycles reduction over the latency and bandwidth lanes
  • Reduce-Window / Pooling Cost — a consumer that prices operand-window DMAs through this latency/bandwidth model
  • ConvolutionCostStateRecordConvKernelCycles, which deposits the operand DMA latency/bandwidth alongside the conv compute