Intra-Chip DMA Descriptor
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (buildlibtpu_lts_20260413_b_RC00, build-id md589edbbe81c5b328a958fe628a9f2207d). The image is not stripped; demangled C++ symbol names are quoted verbatim..textVMA equals file offset (.textbase0xe63c000);.data.rel.rocarries a0x200000VMA→file delta. Other versions will differ.
Abstract
Every on-chip DMA a TensorCore sequencer issues — VMEM↔HBM staging, scalar memsets, the local leg of a collective, host infeed/outfeed — is described by one record: OciDescriptorCommonIssuedFromTcs ("OCI descriptor common, issued from the TensorCore Sequencer"). It is a 17-field, bit-packed structure whose two memory endpoints are each named by a (mem_id, core_id, opcode) triple, whose transfer class is a dma_type tag, and whose size is a (length, length_granule) pair. The same descriptor family is instantiated per silicon generation — pxc (Pufferfish/BarnaCore), vfc/glc/gfc (SparseCore), and vlc (no SparseCore) — and the per-generation classes differ only in the composite names attached to the memory-space enum values; the field layout, the opcode enums, and the dma_type/length_granule enums are identical across all of them.
This descriptor is the intra-chip counterpart of the cross-chip ICI wire descriptor: where the ICI builder (JellyfishDmaDescriptorState / PufferfishDmaDescriptorState) stages an 8- or 24-word array in SMEM and stamps a remote chip id and a remote-sync-flag address into it, the intra-chip descriptor never leaves the chip — both endpoints resolve to local tiers (HBM, VMEM, SMEM, CMEM, IMEM, BMEM, SPMEM). The reader who already understands a DMA engine's "source descriptor / destination descriptor / length / completion-flag" quartet will recognize the shape immediately; the two surprises this page documents are (1) the memory endpoint is polymorphic — a single 2-bit mem_id value resolves to a different physical tier depending on the companion 3-bit core_id — and (2) there are two unrelated DmaType enumerations in the binary, one carried by the LLO/runtime layer and one in the profiler descriptor, with different value counts and orderings.
This page owns three things: the SrcMem/DstMem memory-space enums (MemMemId + MemCoreId) and how a (mem_id, core_id) pair renders to a physical tier; the Src/Dst Opcode enums and the string-valued opcode attributes the LLO DMA ops actually carry; and the descriptor field layout (offsets +0x18..+0x5c) that the LLO DMA-start ops fill. The Simple-vs-Strided DmaParameters selector is on DmaParameters Selector; the rolled/strided/general transfer-body emitters are on Rolled / Strided / General Emitters; the tile-index→flat-offset algebra that produces the address operands is on Tile-Index Expansion.
For reimplementation, the contract is:
- The two memory-space enums — the 4-value
MemMemIdand the 8-valueMemCoreId, and the segment-selection rule that turns a(mem_id, core_id)pair into a named physical tier. - The two opcode enums —
SrcOpcode {READ, …}andDstOpcode {WRITE, …}, and the string-valueddst_opcodeattribute (write_4b/read_and_add/atomic_add) the LLODmaSimpleStartOp/DmaGeneralStartOpcarry, plus the memory-space gates that restrict each opcode. - The endpoint rendering —
MemorySpaceToDriverResource(MemorySpace)(the LLOMemorySpace→ driver-resource id used to stamp the descriptor address word) and how(mem_id, core_id, opcode, addr)compose into a source / destination endpoint. - The field layout — the 17 fields at
+0x18..+0x5c, the(length, length_granule)size pair, and which fields the LLO DMA-start ops write vs. which the profiler descriptor parses.
| Descriptor (profiler/wire) | asic_sw::driver::deepsea::<gen>::profiler::OciDescriptorCommonIssuedFromTcs |
| pxc ctor | …pxc::profiler::OciDescriptorCommonIssuedFromTcs::OciDescriptorCommonIssuedFromTcs @ 0x1cf1b620 |
| Per-gen decode | …<gen>::profiler::DecodeOciDescriptorCommonIssuedFromTcs (pxc @ 0xf5bace0, vlc @ 0xf5e29a0, vfc @ 0xf607b20, glc @ 0xf63a140, gfc @ 0xf66fba0) |
| Per-gen encode | …<gen>::profiler::EncodeOciDescriptorCommonIssuedFromTcs (pxc @ 0xf5cc700) |
| Endpoint render | xla::jellyfish::MemorySpaceToDriverResource(MemorySpace) @ 0x1d6223e0 |
| LLO dst-opcode reader | xla::tpu::sparse_core::GetDstOpcode<DmaSimpleStartOp> @ 0x135aaa60, <DmaGeneralStartOp> @ 0x135b2be0 |
Runtime DmaType enum | xla::jellyfish::DmaType — 3 values (<< operator @ 0x1d5ae080) |
| Field span | +0x18 (trace_id_header) … +0x5c (length_granule); size word at +0x58/+0x5c |
| Evidence grade | Reimplementation-grade / byte-confirmed against IDA decompile + the FDP descriptor pool |
1. The Two Memory-Space Enums (SrcMem / DstMem)
Purpose
A DMA endpoint is not named by a single memory-space integer. The descriptor splits each endpoint's space into two fields — a 2-bit mem_id (src_mem_mem_id @ +0x24, dst_mem_mem_id @ +0x30) and a 3-bit core_id (src_mem_core_id @ +0x28, dst_mem_core_id @ +0x34). The pair is polymorphic: the same mem_id value resolves to a different physical tier depending on which core class the companion core_id selects. This is how four 2-bit codes cover the full HBM / VMEM / SMEM / IMEM / CMEM / BMEM / SPMEM tier space without a wider field.
Encoding
The two enums, byte-verified from the FDP descriptor pool (the enum_type blocks of the carved OciDescriptorCommonIssuedFromTcs DescriptorProto), are identical between source (SRC_*) and destination (DST_*):
MemMemId (2-bit) | pxc composite name (SRC_MEM_MEM_ID_*) |
|---|---|
| 0 | HBM_TCVMEM_BCBMEM |
| 1 | RSVD_TCSMEM_BCSMEM |
| 2 | CMEM_TCIMEM_BCBIMEM |
| 3 | RSVD_RSVD_BCVIMEM |
MemCoreId (3-bit) | name (SRC_MEM_CORE_ID_*) | selects segment |
|---|---|---|
| 0 | RESERVED | — |
| 1 | NONCORE | 1st (HBM / CMEM / …) |
| 2 | TC0 | 2nd (TCVMEM / TCSMEM / TCIMEM) |
| 3 | TC1 | 2nd |
| 4 | BC0 | 3rd (BCBMEM / BCSMEM / BCBIMEM) |
| 5 | BC1 | 3rd |
| 6 | BC2 | 3rd |
| 7 | BC3 | 3rd |
The composite name is a _-joined triple <noncore>_<tensorcore>_<thirdcore>. To resolve a (mem_id, core_id) pair to a physical tier, pick the mem_id row, then select the segment named by the core_id class:
mem_id = 0, core_id = NONCORE -> HBM (HBM_TCVMEM_BCBMEM, 1st segment)
mem_id = 0, core_id = TC0/TC1 -> TCVMEM (2nd segment)
mem_id = 0, core_id = BC0..3 -> BCBMEM (3rd segment)
mem_id = 2, core_id = NONCORE -> CMEM
mem_id = 2, core_id = TC0 -> TC IMEM
mem_id = 1, core_id = TC0 -> TC SMEM
NOTE — the segment-selection rule (NONCORE→1st, TC*→2nd, BC*/SC*→3rd) is INFERRED from the composite-name structure, not byte-proven from a renderer: no
MemMemId × MemCoreId → tiersymbolizer is linked intolibtpu.so. The decode functions (DecodeOciDescriptorCommonIssuedFromTcs) parsemem_id/core_idinto the proto, but the only intra-chip DMA-timeline pass that consumes the descriptor (ConvertDmaTransfersToXPlane@0xf254bc0) drops both endpoints — see §5. Themem_id/core_idvalue→name bindings are CONFIRMED; the tier resolution they imply is INFERRED.
The cross-generation rename
The MemMemId value set is fixed at four; only the composite names change per generation, tracking each gen's third core class:
| Gen / family | mem_id=0 | mem_id=1 | mem_id=2 | mem_id=3 |
|---|---|---|---|---|
pxc (BarnaCore) | HBM_TCVMEM_BCBMEM | RSVD_TCSMEM_BCSMEM | CMEM_TCIMEM_BCBIMEM | RSVD_RSVD_BCVIMEM |
vfc/glc/gfc (SparseCore) | HBM_TCVMEM_SCSPMEM | HOST_TCSMEM_SCSMEM | VMEMALL_TCIMEM_SCSIMEM | NONCORERESERVEDMEM0_TCRESERVEDMEM_SCTIMEM |
vlc (2-name, no SparseCore) | HBM_TCVMEM | HOST_TCSMEM | NONCORERESERVEDMEM0_TCIMEM | NONCORERESERVEDMEM0_TCRESERVEDMEM |
vlc names are 2-segment (no third core), so its core_id only meaningfully selects NONCORE vs TC*. All five gens declare the OciDescriptorCommonIssuedFromTcs class with the same field layout and the same 4-value mem_id enum — confirmed by the five per-gen Decode… / class-data symbols (…vfc…OciDescriptorCommonIssuedFromTcs::GetClassData @ 0x1cf23e40, …vlc… @ 0x1cf2cf80, etc.).
The LLO-side MemorySpace
The descriptor's (mem_id, core_id) pair is the wire/profiler view of the endpoint. Inside the compiler, an LLO DMA operand carries an xla::jellyfish::MemorySpace — the 17-value runtime enum documented on MemorySpace Enum. The descriptor builder turns that MemorySpace into a hardware endpoint via MemorySpaceToDriverResource (§3); the profiler descriptor's mem_id/core_id is the post-hardware-encoding readout of the same endpoint. A reimplementer carries MemorySpace end to end and maps to (mem_id, core_id) only at the descriptor boundary; do not confuse the 17-value LLO enum with the 4-value mem_id.
2. The Opcode Enums (SrcOpcode / DstOpcode)
Purpose
Each endpoint carries a 2-bit opcode (src_opcode @ +0x2c, dst_opcode @ +0x38) that selects what the DMA engine does at that endpoint, beyond plain move. The source opcode chooses between a normal read and the two memset modes; the destination opcode chooses between a plain write and the two special-write / atomic modes.
Encoding
Both enums are 2-bit, byte-verified from the FDP enum_type blocks:
SrcOpcode (+0x2c) | name | meaning |
|---|---|---|
| 0 | READ | normal source read |
| 1 | RESERVED | — |
| 2 | INSTRUCTIONMEMSET | fill IMEM (no source read) |
| 3 | DATAMEMSET | fill data memory (no source read) |
DstOpcode (+0x38) | name | meaning |
|---|---|---|
| 0 | WRITE | normal destination write |
| 1 | RESERVED | — |
| 2 | WRITESPECIAL0 | special write mode 0 |
| 3 | WRITESPECIAL1 | special write mode 1 |
The LLO-side string opcode
The LLO SparseCore DMA-start ops carry the destination opcode as a string-valued MLIR attribute, not the 2-bit code; the lowering maps the string to the enum. GetDstOpcode<DmaSimpleStartOp> @ 0x135aaa60 (and the DmaGeneralStartOp twin @ 0x135b2be0, byte-identical) compare the attribute's bytes against fixed little-endian constants:
// xla::tpu::sparse_core::GetDstOpcode<DmaSimpleStartOp> sub_135AAA60
function GetDstOpcode(op, ms):
s = op.getDstOpcode() // string attr @ this+0x48 (sub_145B9700)
if len(s)==8 && s == "write_4b": // 0x62345F6574697277 LE -> "write_4b"
code = 1
elif len(s)==16 && s == <16-char write attr>: // vptest vs xmmword_A2D00C0
code = 2
elif len(s)==12 && s == "read_and_add": // 0x646E615F64616572 | 0x6464615F
code = 3
elif len(s)==10 && s == "atomic_add": // 0x615F63696D6F7461 | 0x6464
code = atomic-add path
else:
code = 0 // default: plain "write"
// memory-space gates (emitError on violation):
if code != default && ms != Smem:
error("dst_opcode is only supported for Smem.")
if atomic_add && ms != Spmem:
error("Atomic add dst_opcode is only supported for Spmem.")
return code
The string→code table the LLO ops emit:
LLO dst_opcode string | Length | Maps to DstOpcode | MS gate |
|---|---|---|---|
| (absent) | 0 | WRITE (0) | — |
write_4b | 8 | WRITESPECIAL0/1 | Smem only |
| (16-char write attr) | 16 | special write | Smem only |
read_and_add | 12 | special / atomic | Smem only |
atomic_add | 10 | atomic-add (element-typed) | Spmem only |
GOTCHA — the opcode is a string at the LLO/MLIR layer and a 2-bit enum on the wire. A reimplementer that drives the descriptor straight off the 2-bit enum misses the gating: the non-default opcodes are rejected unless the destination memory space is
Smem(orSpmemforatomic_add). ThegetDstOpcodeaccessor reads the attribute atthis+0x48with a bit-test on the property word ((*(this+44) >> 19) & 0x10), so the attribute is optional — absent means the plainWRITEopcode. The 16-char variant's exact spelling lives inxmmword_A2D00C0and was not string-decoded (HIGH, not CONFIRMED).
The atomic_add branch additionally inspects the destination memref's element type (isF32 → code 1, isBF16 → 2, Float8E4M3FN → 3, else "Unsupported element type for atomic add."), so the atomic-add opcode is further specialized by element width inside the Spmem path.
3. Endpoint Rendering
Purpose
"Rendering" an endpoint means turning an LLO (MemorySpace, byte-offset) operand into the descriptor's address word(s). The memory space becomes a driver resource id placed in the address word's high field; the offset is scaled to the address granule and OR'd in. The same render path serves source and destination; the descriptor's mem_id/core_id fields are the hardware's readback of this encoding.
Algorithm
The space→resource mapping is MemorySpaceToDriverResource, a dense switch over the LLO MemorySpace enum (the 17-value runtime enum, not the proto). It is the single byte-anchored ground truth for which tiers are DMA-addressable and what resource id each gets:
// xla::jellyfish::MemorySpaceToDriverResource(MemorySpace ms) sub_1D6223E0
function MemorySpaceToDriverResource(ms):
switch ms: // ms = the 17-value LLO MemorySpace enum
case 0 (<no memory space>): return 10
case 1 (hbm): return 2
case 2 (hib): return 3
case 3 (vmem): return 4
case 4 (cmem): FATAL("Unsupported memory space") // memory_space.cc:31
case 5 (smem): return 6
case 6 (sflag): return 0
case 7 (imem): return 5
case 8 (barna_core_bmem): return 7
case 9 (barna_core_smem): return 9
case 10 (barna_core_sflag): return 1
case 11 (barna_core_imem): return 8
case 12..16: FATAL("Unsupported memory space") // memory_space.cc:49
The returned resource id is stamped into the descriptor's address word at bit 40 (SetSourceAddress(resource << 0x28) in the ICI builder ctor; the same << 0x28 shift renders the intra-chip endpoint), and the within-tier offset is OR'd into the low bits after granule scaling (EncodeDmaAddressForGranule @ 0x1d5402c0, which additionally OR's bit 31 = 0x80000000 for HBM/external-resource operands as the external-address marker).
Note — the driver-resource id this function returns is not the LLO
MemorySpaceenum value — the map is a permutation.hbm(enum 1) renders to resource id 2,vmem(enum 3) to 4,smem(enum 5) to 6,sflag(enum 6) to 0, andcmem(enum 4) is a hardFATAL— never DMA-addressable through this path. Read the resource id off the switch arm for the specific space, not off the enum ordinal; the only value that happens to coincide issmem→6.
The resource-id table
LLO MemorySpace | Enum# | Driver resource id |
|---|---|---|
<no memory space> | 0 | 10 |
hbm | 1 | 2 |
hib | 2 | 3 |
vmem | 3 | 4 |
cmem | 4 | — (FATAL) |
smem | 5 | 6 |
sflag | 6 | 0 |
imem | 7 | 5 |
barna_core_bmem | 8 | 7 |
barna_core_smem | 9 | 9 |
barna_core_sflag | 10 | 1 |
barna_core_imem | 11 | 8 |
sparse_core_* (12..16) | 12..16 | — (FATAL) |
QUIRK — the resource-id assignment is not the identity of the
MemorySpaceenum, and it is not monotone:sflag→0,barna_core_sflag→1,hbm→2,hib→3,vmem→4,imem→5,smem→6,barna_core_bmem→7,barna_core_imem→8,barna_core_smem→9, NONE→10. A reimplementer must use the explicit table; deriving the resource id from the space integer is wrong for every row. Thecmemandsparse_core_*cases trap atLogMessageFatal— those spaces are addressable by LLO loads/stores but not as a DMA endpoint on this gen's resource model, so a DMA targeting them is a compile-time fatal, not a silent miss.
4. The Descriptor Field Layout
Purpose
OciDescriptorCommonIssuedFromTcs is a 17-field, bit-packed record. The fields the LLO DMA-start ops fill, and the offsets the per-gen Decode… functions read back, are fixed across all generations.
Layout
Field offsets are byte-verified from the carved DescriptorProto (the FDP message body) and the pxc producer store. The structure is bit-packed — these are the parsed-into-proto member offsets, not raw bit positions:
| # | Field | Offset | Width / enum | Filled by |
|---|---|---|---|---|
| 1 | trace_id_header | +0x18 | u64 | DMA-id producer (pairing key) |
| 2 | dma_type | +0x20 | DmaTypeValues (4) | DMA lowering |
| 3 | src_mem_mem_id | +0x24 | MemMemId (2-bit) | source endpoint render |
| 4 | src_mem_core_id | +0x28 | MemCoreId (3-bit) | source endpoint render |
| 5 | src_opcode | +0x2c | SrcOpcode (2-bit) | source opcode |
| 6 | dst_mem_mem_id | +0x30 | MemMemId (2-bit) | dest endpoint render |
| 7 | dst_mem_core_id | +0x34 | MemCoreId (3-bit) | dest endpoint render |
| 8 | dst_opcode | +0x38 | DstOpcode (2-bit) | GetDstOpcode<> |
| 9 | src_sync_flag_id | +0x3c | u32 | sync-flag binding |
| 10 | src_sync_flag_core_id | +0x40 | SyncFlagCoreId (8) | sync-flag binding |
| 11 | dst_sync_flag_0_id | +0x44 | u32 | completion flag 0 |
| 12 | dst_sync_flag_0_core_id | +0x48 | SyncFlagCoreId (8) | completion flag 0 |
| 13 | dst_sync_flag_1_id | +0x4c | u32 | completion flag 1 |
| 14 | dst_sync_flag_1_core_id | +0x50 | SyncFlagCoreId (8) | completion flag 1 |
| 15 | program_counter | +0x54 | u32 | trace/PC |
| 16 | length | +0x58 | u32 | size operand |
| 17 | length_granule | +0x5c | LengthGranule (1-bit) | size operand |
The three SyncFlagCoreId enums (+0x40, +0x48, +0x50) share the same 8-value set as MemCoreId (RESERVED/NONCORE/TC0/TC1/BC0..BC3) — the sync-flag target core mirrors the memory-target core. The dual dst_sync_flag_{0,1} pair is the dual-channel completion that the V2 ICI descriptor also exposes (via set_dst_sync_flag_mem_offset(idx 0..1)); on the intra-chip descriptor both are present in the layout.
The size pair (length × length_granule)
The transfer byte count is length << granule_shift, where the shift is chosen by the 1-bit length_granule:
LengthGranule (+0x5c) | name | shift | bytes per unit |
|---|---|---|---|
| 0 | 512B | << 9 | 512 |
| 1 | 4B | << 2 | 4 |
The pxc DMA-timeline producer (the ConvertTpuTraceToXPlane<pxc> lambda, store block @ 0xf26c865) is the byte-exact proof of the size decode and of which fields survive into a rendered span:
// producer id-91 store @0xf26c865
span.begin_gtc = begin_gtc; // [r15+0x8]
span.begin_present = 1; // [r15+0x10]
shift = (descr.length_granule == 0) ? 9 : 2; // cmp [r14+0x5c],0 ; mov ecx,2 ; mov eax,9 ; cmove
bytes = descr.length << shift; // eax = [r14+0x58] ; shl rax, cl
span.byte_count = bytes; // [r15+0x28]
span.kind_tag = 3; // [r15+0x40]
// NO read of descr[+0x24 .. +0x50] -> src/dst endpoints + sync flags are NOT copied
How LLO ops fill it
The descriptor fields are populated from an LLO Dma*StartOp op's operands and attributes (Tile-Index Expansion produces the address operands; DmaParameters Selector chooses the Simple vs Strided op shape):
src_mem_*/dst_mem_*← the source / destination operand'sMemorySpace, rendered viaMemorySpaceToDriverResource(§3).dst_opcode← the op's stringdst_opcodeattribute viaGetDstOpcode<>(§2);src_opcodedefaults toREADunless a memset op selectsINSTRUCTIONMEMSET/DATAMEMSET.length/length_granule← the size operand, with the granule chosen so the byte count fits the size field (the ICI builder'sWriteSizeenforcesgranule_bytes <= 1024, i.e. a 512-B granule caps a single descriptor's size word at the 10-bit-equivalent field).dst_sync_flag_*← the completion sync flag the DMA bumps when the last byte lands (the on-chip analogue of the ICI receive-side auto-increment).enable_trace← the op'sgetEnableTraceattribute (DmaSimpleStartOp::getEnableTrace@0x145b9620), which toggles whether the descriptor emits a profiler trace entry.
5. What the Renderer Keeps vs Drops
The single intra-chip DMA-timeline pass, xprof::tpu::ConvertDmaTransfersToXPlane @ 0xf254bc0, renders one paired DMA into one device XEvent. It is the only consumer that reads the descriptor for display, and it decodes the endpoints into the proto but never copies them into the rendered span. The span's begin/duration are the XEvent's own offset/duration fields, stamped directly by TpuXLineBuilder::AddEvent(event, begin, end−begin) from the pre-decoded DmaTransfer timespan — they are not XStats and carry no StatType id. On top of that the pass attaches six XStats per span, all via GetStatTypeStr(N) (numeric) or a literal stat-metadata name (string):
| Span field / XStat | StatType | Source |
|---|---|---|
| event offset (begin) | — (XEvent field) | AddEvent(begin) — DmaTransfer begin timespan |
| event duration | — (XEvent field) | AddEvent(end−begin) — DmaTransfer end−begin |
bytes_transferred | 78 | DmaTransfer.byte_count (producer-decoded length << granule_shift) |
queue | 79 | string from DmaTransfer — empty on pxc |
details | (string "details") | string from DmaTransfer — empty on pxc |
_a | 42 | constant 1 (per-DMA aggregate marker) |
flow | 56 | XFlow::next_flow_id_ via 4·(id & 0xff_ffff_ffff_ffff)+3 (begin↔end arrow) |
bandwidth | (string "bandwidth") | FormatPack(unit, bytes/(dur/1e12)), 5-rung B/KB/MB/GB/TB ladder |
GOTCHA — the
src_mem_*/dst_mem_*/*_opcode/*_sync_flag_*fields (descriptor+0x24..+0x50) are parsed byDecodeOciDescriptorCommonIssuedFromTcsbut the only descriptor fields that survive into theDmaTransferthe producer hands this pass arelength(+0x58),length_granule(+0x5c, decoded to the byte count), and thetrace_id_header(+0x18, for begin/end pairing). Thequeueanddetailsstring stats are attached on every span but their backing strings are empty on the pxc producer, so a captured pxc trace shows them blank (the pass skips theSetStatValuecall when theDmaTransferstring length is zero). Annm -Cscan finds noMemMemId/MemCoreId → tieror endpoint→stat symbolizer inlibtpu.so. So the endpoint enums of §1 / §2 are fully decodable but have no display renderer inside this unit — a downstream xprof/TensorBoard symbolizer, if any, would have to re-read the proto fields the pass discards.
6. The DmaType Enums — Two Unrelated Enumerations
The transfer class appears under the name DmaType in two places with different value sets, and conflating them is a trap.
The runtime / LLO enum xla::jellyfish::DmaType is recovered from its absl log operator (operator<<<DmaType> @ 0x1d5ae080) — a dense 3-value switch:
xla::jellyfish::DmaType | Value |
|---|---|
DMA_TYPE_CHIP_TO_HOST | 0 |
DMA_TYPE_LOCAL_OR_HOST | 1 |
DMA_TYPE_REMOTE_WRITE_UNICAST | 2 |
The profiler descriptor enum DmaTypeValues (dma_type @ +0x20) is a 4-value set with a different ordering:
DmaTypeValues (pxc) | Value |
|---|---|
DMA_TYPE_LOCAL | 0 |
DMA_TYPE_CHIP2HOST | 1 |
DMA_TYPE_REMOTEUNICAST | 2 |
DMA_TYPE_REMOTEMULTICAST | 3 |
On the SparseCore gens (vfc/vlc/glc/gfc) the descriptor DmaTypeValues collapses to two values {LOCALORHOST=0, REMOTEUNICAST=1}. The runtime layer additionally names a wider set of .rodata strings (DMA_TYPE_LOCAL, DMA_TYPE_LOCAL_OR_HOST, DMA_TYPE_CHIP_TO_HOST, DMA_TYPE_REMOTE_UNICAST, DMA_TYPE_REMOTE_WRITE_UNICAST, DMA_TYPE_REMOTE_MULTICAST) selected by BuildDmaOverrides(srcMS, dstMS, isRemote, DmaType, …).
QUIRK —
DMA_TYPE_LOCALis the value0of the profiler enum (the intra-chip case) but is not a value of the 3-value runtimexla::jellyfish::DmaTypeoperator (which starts atDMA_TYPE_CHIP_TO_HOST=0). The two enums share spelling fragments but disagree on count, order, and the value of0. A reimplementer must keep them in separate namespaces: the LLO/runtimeDmaTypedrives theBuildDmaOverridesregistry that picks the control-word override; the descriptorDmaTypeValuesis the wire/profiler tag the hardware emits. For the intra-chip descriptor this page owns, the relevantdma_typeis the profilerDMA_TYPE_LOCAL=0.
Cross-References
- DmaParameters Selector — the Simple-vs-SingleStrided op-shape selector and dim-coalescing that decides which LLO DMA-start op fills this descriptor
- Rolled / Strided / General Emitters — the transfer-body emitters that issue the descriptor once its fields are filled
- Tile-Index Expansion —
ExpandTiledMemRefs/expandTiledIndices, which produces the address operands rendered into the source/destination endpoints - MemorySpace Enum — the 17-value LLO
MemorySpaceenum whose valuesMemorySpaceToDriverResourcemaps to driver resource ids - Memory-Load Slot / Memory-Store Slot — the VLIW slots that encode a
MemorySpacetag for the operands a DMA endpoint resolves - Host↔Device DMA — the
DeriveHostDmaTransfers/ tag-6/7 host path (DMA_TYPE_CHIP_TO_HOST/DMA_TYPE_LOCAL_OR_HOST) - OCI Command DMA-ID — the
trace_id_header(+0x18) DMA-id that pairs a descriptor's begin/end trace points - The net_router Emitter Pipeline — the collective router whose local leg issues intra-chip descriptors; the cross-chip remote-endpoint encoding is its counterpart
- Address-Space IDs — the SparseCore
mlir::sparse_core::MemorySpaceAS-ID enum, the unrelated sibling numbering at the SC boundary