Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Wire-Format Constants

A reimplementation of tileiras that aims for byte-for-byte parity with a shipped binary must reproduce a small set of magic numbers, tag namespaces, opcode tables, and obfuscation ciphers exactly. Every constant in this page is fingerprintable from a stripped 88 MB tileiras ELF and verified against the cross-referencing dispatchers documented in MLIR Bytecode Format, LLVM Fingerprint Table, and ISelDAG and MatcherTable. The constants are not configuration. Changing any one of them produces an artifact that fails to interoperate with the shipped reader or fails to bind against the AsmWriter's post-decryption string pool.

The page is organized strictly by layer of the wire format, walking from the outermost envelope down to the innermost emitter. Each section lists the constants the layer defines, the exact byte offsets and lengths where they live in the binary, and the authoritative cross-reference for the dispatch site that reads them. Where a constant is interesting on its own — a typo preserved across builds, an unused mid-slot in a bit-mask, a numbering divergence from upstream — the rationale is captured inline rather than buried in a footnote.

Layer 1 — TileIR Bytecode Envelope

The bytecode container's framing prefix is the single most reproduced constant in this binary. Stock LLVM 18-21 MLIR-bytecode files share the first three bytes; the private TileIR dialect tag occupies bytes 3-7 and the trailing terminator byte at offset 7 separates TileIR from upstream MLIR at the magic-byte level.

OffsetByteSymbolic nameMeaning
00x06MAGIC_LEN_HIMLIR-bytecode framing prefix (shared with upstream)
10x03MAGIC_LEN_LOMLIR-bytecode framing prefix (shared with upstream)
20x80MAGIC_FLAGSMLIR-bytecode framing prefix (shared with upstream)
30x54dialect byte 1'T'
40x69dialect byte 2'i'
50x6Cdialect byte 3'l'
60x65dialect byte 4'e'
70x00tileiras terminatorUpstream writes '\n' (start of "\nMLIR") here

The literal lives at rodata 0x45EBF08 and is compared byte-for-byte by sub_5838A0 against the input buffer; mismatch surfaces a three-fragment diagnostic ("invalid magic number at position " / ", got " / " expected ").

The version block follows immediately after the magic and is a sequence of three unsigned-LEB128 VarInts: major, minor, optional patch. The accepted range table at rodata 0x45EBF10 is verbatim:

static const TileVersion supported_versions[] = {
    /*min:*/ { .major = 13, .minor = 1, .patch = 0          }, // inclusive
    /*max:*/ { .major = 13, .minor = 1, .patch = UINT32_MAX }, // inclusive (only 13.1.x)
};

Any major or minor other than 13.1 is rejected; the patch field is read for forward compatibility but never gated on.

The section ID space is dense in [0x00, 0x06] and the 0x00 slot doubles as the end-of-bytecode marker:

IDSectionRequiredReference width
0x00EndOfBytecoderequired (last)none
0x01Stringrequiredu32 offsets
0x02Funcrequiredsequential
0x03Debugoptionalu32 and u64 offsets
0x04Constantoptionalu64 offsets
0x05Typeoptionalu32 offsets
0x06Globaloptionalsequential

Section header padding is 0xCF. The on-disk section order is the producer's choice, but the walker order is fixed: STRING → TYPE → CONSTANT → IR → optional RESOURCE/DEBUG. See MLIR Bytecode Format for the dependency-ordered dispatch.

QUIRK — terminator byte 7 is the file-format split A bytecode container with the first seven bytes identical to upstream MLIR and byte 7 set to 0x00 is TileIR; a container with byte 7 set to '\n' (0x0A) and bytes 8-11 spelling "MLIR" is upstream MLIR. The two file formats share enough framing that a magic-number sniff that only checks bytes 0-2 will mis-classify both as "some MLIR bytecode dialect." A reimplementation that wants to refuse upstream MLIR inputs early must compare all eight bytes — anything less lets stock MLIR bytecode bind to the TileIR header parser and produce mangled tag-table errors several sections in.

Layer 2 — TypeTag Namespace (sub_59C710)

The Type section's per-record tag is a one-byte slot at offset 0 of the payload, followed by a tag-specific operand list. The dense numbering 0..18 is independent of upstream MLIR's BytecodeTypeOpcodes.td:

TagTypeOperands (VarInt count)
0..4i1, i8, i16, i32, i640
5..11f16, bf16, f32, tf32, f64, f8E4M3FN, f8E5M20
12Pointer (element type)1
13Tile (element + i64 shape)2 + dim_count
14TensorView (element + shape + strides)3 + dim_count + stride_count
15PartitionView (element + shape + dim-map + mode byte)4 + dim_count + map_count
16Function (input list + result list)2 + input_count + result_count
17Token0
18f8E8M0FNU (extension)0

The trailing f8E8M0FNU extension is an element type — like tags 5..11 it carries no payload of its own. Tag 18 is reachable only as a leaf inside a tile-family aggregate (TileType, TensorViewType, PartitionViewType), so the operand-zero contract holds whether the tag is decoded standalone or through one of the aggregate-type arms.

Layer 3 — AttrTag Numbering (sub_59F100)

The most consequential single constant table in the file. The shipped tileiras AttrTag numbering is wire-format-breaking versus upstream MLIR's mlir/Bytecode/BytecodeEnums.h::AttributeTag. Both tables are reproduced side by side so the divergence is unambiguous:

AttrTagUpstream MLIRTileiras sub_59F100
0(reserved / sentinel)(default-arm; emits "unsupported AttributeTag")
1IntegerAttrStringAttr
2FloatAttrFloatAttr
3BoolAttrTypeAttr
4TypeAttrDenseElementsAttr (int/float)
5StringAttrDenseElementsAttr (string)
6ArrayAttrDivByAttr
7DenseElementsDenseI64ArrayAttr (variant A)
8DivByAttrDenseI64ArrayAttr (variant B)
9SameElementsAttrSameElementsAttr
10DictionaryBoundedAttr (variant 0)
11OptimizationHintsBoundedAttr (variant 1)
12BoundedAttrBoundedAttr (variant 2)
13(no upstream slot)AssumePredicateAttr

Only tag 2 (FloatAttr) matches upstream by coincidence. Every other tag in the 1..13 range disagrees: tag 1 is StringAttr here versus upstream IntegerAttr; tag 4 lands on DenseElementsAttr instead of TypeAttr; tag 5 lands on DenseElementsAttr<string> instead of StringAttr; tag 6 lands on DivByAttr instead of ArrayAttr. Going the other direction, an AssumePredicateAttr emitted by tileiras at tag 13 has no destination in upstream's table at all. Any external tool that needs to round-trip MLIR bytecode through both implementations must freeze the tileiras numbering above; the upstream header is reserved for future stock cuda_tile builds.

The parallel DebugTag namespace at sub_589B90 is private to the Debug section and uses a dense [0..6] range. Tag 0 is the failure sentinel; tags 1-6 cover DICompileUnit, DIFile, DILexicalBlock, DILoc, DISubprogram, CallSite respectively. No upstream LLVM debug-info tag table participates in this dispatcher.

Layer 4 — cuda_tile Opcode Space (sub_5B13D0)

The 110-row cuda_tile opcode table is dense in [0..109] with two reserved holes the dispatcher leaves on the default arm:

RangeStatus
0..24Assigned (absf through exp2)
25..36Reserved hole — emits "unknown or unimplemented opcode: "
37..51Assigned (exti through int_to_ptr)
52..57Reserved hole — emits "unknown or unimplemented opcode: "
58..109Assigned (iota through yield)

Opcode 0x6E (atan2 in the public 13.2 namespace) is absent from this binary. The dispatcher has no case for it and embeds no cuda_tile.atan2 mnemonic string; encoding the op lands on the default arm. This places the binary at a 13.1-vintage opcode-table snapshot.

The full per-opcode mnemonic / handler-address table lives in MLIR Bytecode Format — Operation Opcode Dispatch.

The location-index slot is signed zig-zag LEB128: the value 0x7F after zig-zag decode is -1, which the dispatcher resolves to UnknownLoc (typical of a --lineinfo-less compile).

Layer 5 — NVPTX MatcherTable Pools (XOR-3 Cipher)

The NVPTX AsmWriter ships two .data mnemonic pools obfuscated by a walking XOR cipher. Byte i is XORed with (3 * i) mod 256, decoded once at startup, and cached behind a pointer at qword_5B4F4D0.

void xor3_decode(uint8_t *begin, uint8_t *end) {
    uint8_t key = 0;
    for (uint8_t *p = begin; p != end; ++p) {
        *p ^= key;
        key = (uint8_t)(key + 3);
    }
}

The two pools and their decoder entry points are:

PoolRangeLengthDecoderCached pointer
Opcode mnemonic0x5A4C080 .. 0x5A656F0~105 KBsub_1BD1810qword_5B4F4D0
Physical-register-name0x5A4BE20 .. 0x5A4C06A586 Bsub_1BD1830(post-decode cached)

The cipher is not a security boundary. Its only effect is to prevent a naive strings(1) sweep from surfacing every PTX mnemonic. A reimplementation that does not need binary-for-binary .data parity can store the same strings plainly.

The shape of the decoded pool matches LLVM 21's NVPTXGenAsmWriter output; the pattern-name strings paired with each OPC_* row of the MatcherTable ("setmaxregister", "cp.async.bulk.tensor.group.shared.cluster", "wgmma.mma_async.sync.aligned", "wgmma.fence.sync.aligned", "tcgen05.mma.sync", "tcgen05.mma.ws.sync", "mma.block_scaled.sync.aligned", "mma.sp.sync.aligned.m8n8k16") sit unencrypted in .rodata since they are TableGen pattern records rather than printer-side mnemonic literals.

Layer 6 — NVPTX ProxyReg Whitelist

The post-ISel NVPTXProxyRegErasure peephole uses a contiguous opcode range rather than a named whitelist. The TableGen-side consolidation that landed in LLVM 21 trunk just before the 21.0 cut replaced the older per-type ProxyRegInst<*> template with a four-way emit that produces adjacent indices:

MI opcodeType classTableGen name
3156i16ProxyRegI16
3157i32ProxyRegI32
3158i64ProxyRegI64
3159f32 / f64ProxyRegF

The check at 0x1AE5086 is sub eax, 0xC54 ; cmp eax, 3 — a contiguous range test that costs two x86 instructions. Stock LLVM 18 used a 5-6-element named whitelist, so the contiguous numbering is itself a fingerprint for the LLVM 21 NVPTX backend. Reimplementations cannot pick arbitrary opcode numbers for the typed ProxyReg family without breaking the peephole's hot-path test.

Layer 7 — FTZ-Path Constants in SelectIntrinsic_W_Chain Case 0x66

The per-call FTZ override in case 0x66 of sub_1A854E0 carries two MI opcode literals and one SDNode flag bit that must reproduce exactly:

ConstantValueMeaning
FTZ-path FMA opcode0x65FMA_FTZ; emitted when probe selects FTZ
Non-FTZ-path wrapper opcode0xF7FMA_NON_FTZ; emitted when probe selects IEEE
FTZ-authorization flag bit0x40NoFPExcept reinterpreted as per-node FTZ-authorize signal
Inner FMAD opcode0x63Set with NoFPExcept (0x200) on the FTZ four-instruction chain
INST_WRAPPER opcode (non-FTZ)0xD2Holds chain through ADDRESSOF wrap
CopyToReg opcode0x11Standard LLVM SDNode opcode
MUL_ADD_f32 / MUL_ADD_f64207 / 208MVT-keyed select after the wrapper chain

QUIRK — NoFPExcept flag bit 0x40 repurposed as FTZ-authorization Upstream LLVM treats SDNode flag bit 0x40 (NoFPExcept) as a pure FP-exception -safety advisory: it tells later passes that no FP exception can be raised. In case 0x66 of sub_1A854E0, tileiras reads the same bit before the "unsafe-fp-math" function attribute and treats it as a per-node "authorize FTZ substitution" signal. A combine that legitimately sets NoFPExcept on a single FMA in an otherwise IEEE-denormal function therefore silently switches that one FMA to fma.rn.ftz.f32 (opcode 0x65) instead of the FMA_NON_FTZ wrapper (0xF7). A reimplementation that imports upstream flag semantics will produce different PTX for the same SDAG.

Layer 8 — cvt_packfloat Validator Constants (sub_1A84900)

The four-gate cvt_packfloat validator carries five subtarget-level constants:

ConstantValueGate
SM major floor0x384 (sm_90)Gate 1
PTX version floor0x4D (PTX 7.7)Gate 1
sm_100a SM major0xA0Gates 2 and 3 (UE8M0x2, fp6x2/fp4x2)
sm_100f SM minor0xFGate 4 (family-conditional)
tmem feature byteoffset 80 in subtarget feature array at unk_5BEBD51tcgen05 128-bit atomic guard at sub_1A80A40

QUIRK — atleast typo and mismatched PTX number in gate-one diagnostic Gate one's diagnostic string is "cvt_packfloat intrinsic needs atleast SM90 and PTX >= 78": the missing space in atleast is preserved byte-for-byte, and the message advertises PTX >= 78 even though the actual compare is against 0x4D (PTX 7.7, not 7.8). The discrepancy stems from an internal NVIDIA test-suite log scraper that keys on the verbatim string. A reimplementer who "fixes" either the spelling or the number desyncs that scraper without changing behavior.

Layer 9 — LLVM 21 NVPTX Data-Layout Stamp

Every NVPTX module emitted by tileiras carries one verbatim data-layout string, unconditionally stamped before bitcode serialization:

e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-
v128:128:128-n16:32:64

Length: 154 bytes (0x9A). Rodata location: 0x4D079D0. Sole xref: sub_1A4E5C0 at 0x1A4E5D1. Address space 3 (p3:32:32:32) marks NVPTX shared memory as 32-bit-pointer. The string is byte-identical to stock LLVM 21 NVPTX64, and is one of the ten independent fingerprints that pin the LLVM base version in LLVM Fingerprint Table.

Layer 10 — LLVM Bitcode Producer Strings

Two .rodata strings stamp the LLVM base version into every emitted module:

SlotRodata addressLengthVerbatim string
IDENTIFICATION_CODE_STRING0x4F882C413 BLLVM21.0.0git
NVPTX AsmPrinter emitHeader line 3(inside sub_1A56540)variesBased on LLVM 21.0.0git
libNVVM module name (when libNVVM path is taken)(compile-time literal)10 Bmlir-input

The producer string is emitted as the bitcode-writer's IDENTIFICATION subblock record at sub_3935490 (the EnterSubblock(IDENTIFICATION, 5) site). The AsmPrinter header comment block is written at every PTX-emit invocation; the third line of four is the verbatim Based on LLVM 21.0.0git literal, not a runtime-formatted template.

Cross-Layer Constant Index

For a reimplementation walking the wire format top-down, the constants converge on a small handful of source-of-truth dispatchers. The index below maps each constant back to the page that documents its dispatch site at reimplementation depth.

LayerConstant classAuthority page
1Magic bytes, version range, section IDsMLIR Bytecode Format
2TypeTag 0..18MLIR Bytecode Format — Type Tag Dispatch
3AttrTag 0..13, DebugTag 0..6MLIR Bytecode Format — Self-Contained Attribute Dispatch
4cuda_tile opcodes 0..109, reserved holesMLIR Bytecode Format — Operation Opcode Dispatch
5XOR-3 cipher, pool rangesISelDAG and MatcherTable — AsmWriter String Tables
6ProxyReg whitelist [3156, 3159]LLVM Fingerprint Table — Fingerprint 8
7FMA opcodes 0x65 / 0xF7, flag bit 0x40ISelDAG and MatcherTable — NVIDIA-Specific ISel Patches
8cvt_packfloat SM/PTX floors, tmem feature byteISelDAG and MatcherTable — NVIDIA-Specific ISel Patches
9NVPTX64 data-layout stringLLVM Fingerprint Table — Fingerprint 1
10LLVM21.0.0git, Based on LLVM 21.0.0git, mlir-inputLLVM Fingerprint Table — Fingerprints 2, 3

Reimplementation Contract

Three rules summarize the constraint these constants impose on a clean-room reimplementation:

  1. Magic, AttrTag numbering, and cuda_tile opcode table are wire-format invariants. A reimplementation that picks any other byte for offset 7, any other tag-to-attribute-kind mapping in sub_59F100's switch, or any other opcode-to-mnemonic assignment in sub_5B13D0's switch produces bytecode that the shipped reader either rejects or silently mis-decodes.
  2. NVPTX MatcherTable pool ranges, ProxyReg numbering, and FMA opcode numbers are emitter invariants. A reimplementation that ships different bytes here still produces valid PTX, but the binary-for-binary .data and MIR cross-checks NVIDIA's internal regression suite runs against tileiras output will fail.
  3. All diagnostic strings — including the atleast typo, the PTX >= 78 off-by-one, the FileLineColLoc debug-attr naming inheritance — are contract surface. Test-suite log scrapers key on verbatim spelling. "Fixing" any of them is a behavioral change as far as downstream tools are concerned, even though the fix is locally correct.

The shared property across all three rules is that no constant in this page is configuration. Each is either a header-stamped invariant frozen at build time, a table TableGen emitted into the binary at LLVM 21 cut-time, or a literal NVIDIA chose for hand-rolled validator code. A reimplementation that wants compatibility must freeze every one of them.