Wire-Format Constants
A reimplementation of tileiras that aims for byte-for-byte parity with a shipped binary
must reproduce a small set of magic numbers, tag namespaces, opcode tables, and obfuscation
ciphers exactly. Every constant in this page is fingerprintable from a stripped 88 MB
tileiras ELF and verified against the cross-referencing dispatchers documented in
MLIR Bytecode Format,
LLVM Fingerprint Table, and
ISelDAG and MatcherTable. The constants are
not configuration. Changing any one of them produces an artifact that fails to
interoperate with the shipped reader or fails to bind against the AsmWriter's
post-decryption string pool.
The page is organized strictly by layer of the wire format, walking from the outermost envelope down to the innermost emitter. Each section lists the constants the layer defines, the exact byte offsets and lengths where they live in the binary, and the authoritative cross-reference for the dispatch site that reads them. Where a constant is interesting on its own — a typo preserved across builds, an unused mid-slot in a bit-mask, a numbering divergence from upstream — the rationale is captured inline rather than buried in a footnote.
Layer 1 — TileIR Bytecode Envelope
The bytecode container's framing prefix is the single most reproduced constant in this binary. Stock LLVM 18-21 MLIR-bytecode files share the first three bytes; the private TileIR dialect tag occupies bytes 3-7 and the trailing terminator byte at offset 7 separates TileIR from upstream MLIR at the magic-byte level.
| Offset | Byte | Symbolic name | Meaning |
|---|---|---|---|
| 0 | 0x06 | MAGIC_LEN_HI | MLIR-bytecode framing prefix (shared with upstream) |
| 1 | 0x03 | MAGIC_LEN_LO | MLIR-bytecode framing prefix (shared with upstream) |
| 2 | 0x80 | MAGIC_FLAGS | MLIR-bytecode framing prefix (shared with upstream) |
| 3 | 0x54 | dialect byte 1 | 'T' |
| 4 | 0x69 | dialect byte 2 | 'i' |
| 5 | 0x6C | dialect byte 3 | 'l' |
| 6 | 0x65 | dialect byte 4 | 'e' |
| 7 | 0x00 | tileiras terminator | Upstream writes '\n' (start of "\nMLIR") here |
The literal lives at rodata 0x45EBF08 and is compared byte-for-byte by sub_5838A0
against the input buffer; mismatch surfaces a three-fragment diagnostic
("invalid magic number at position " / ", got " / " expected ").
The version block follows immediately after the magic and is a sequence of three
unsigned-LEB128 VarInts: major, minor, optional patch. The accepted range
table at rodata 0x45EBF10 is verbatim:
static const TileVersion supported_versions[] = {
/*min:*/ { .major = 13, .minor = 1, .patch = 0 }, // inclusive
/*max:*/ { .major = 13, .minor = 1, .patch = UINT32_MAX }, // inclusive (only 13.1.x)
};
Any major or minor other than 13.1 is rejected; the patch field is read for forward
compatibility but never gated on.
The section ID space is dense in [0x00, 0x06] and the 0x00 slot doubles as the
end-of-bytecode marker:
| ID | Section | Required | Reference width |
|---|---|---|---|
0x00 | EndOfBytecode | required (last) | none |
0x01 | String | required | u32 offsets |
0x02 | Func | required | sequential |
0x03 | Debug | optional | u32 and u64 offsets |
0x04 | Constant | optional | u64 offsets |
0x05 | Type | optional | u32 offsets |
0x06 | Global | optional | sequential |
Section header padding is 0xCF. The on-disk section order is the producer's
choice, but the walker order is fixed: STRING → TYPE → CONSTANT → IR → optional
RESOURCE/DEBUG. See MLIR Bytecode Format
for the dependency-ordered dispatch.
⚡ QUIRK — terminator byte 7 is the file-format split A bytecode container with the first seven bytes identical to upstream MLIR and byte 7 set to
0x00is TileIR; a container with byte 7 set to'\n'(0x0A) and bytes 8-11 spelling"MLIR"is upstream MLIR. The two file formats share enough framing that a magic-number sniff that only checks bytes 0-2 will mis-classify both as "some MLIR bytecode dialect." A reimplementation that wants to refuse upstream MLIR inputs early must compare all eight bytes — anything less lets stock MLIR bytecode bind to the TileIR header parser and produce mangled tag-table errors several sections in.
Layer 2 — TypeTag Namespace (sub_59C710)
The Type section's per-record tag is a one-byte slot at offset 0 of the payload,
followed by a tag-specific operand list. The dense numbering 0..18 is independent
of upstream MLIR's BytecodeTypeOpcodes.td:
| Tag | Type | Operands (VarInt count) |
|---|---|---|
0..4 | i1, i8, i16, i32, i64 | 0 |
5..11 | f16, bf16, f32, tf32, f64, f8E4M3FN, f8E5M2 | 0 |
12 | Pointer (element type) | 1 |
13 | Tile (element + i64 shape) | 2 + dim_count |
14 | TensorView (element + shape + strides) | 3 + dim_count + stride_count |
15 | PartitionView (element + shape + dim-map + mode byte) | 4 + dim_count + map_count |
16 | Function (input list + result list) | 2 + input_count + result_count |
17 | Token | 0 |
18 | f8E8M0FNU (extension) | 0 |
The trailing f8E8M0FNU extension is an element type — like tags 5..11 it carries
no payload of its own. Tag 18 is reachable only as a leaf inside a tile-family
aggregate (TileType, TensorViewType, PartitionViewType), so the operand-zero
contract holds whether the tag is decoded standalone or through one of the
aggregate-type arms.
Layer 3 — AttrTag Numbering (sub_59F100)
The most consequential single constant table in the file. The shipped tileiras
AttrTag numbering is wire-format-breaking versus upstream MLIR's
mlir/Bytecode/BytecodeEnums.h::AttributeTag. Both tables are reproduced side by
side so the divergence is unambiguous:
| AttrTag | Upstream MLIR | Tileiras sub_59F100 |
|---|---|---|
| 0 | (reserved / sentinel) | (default-arm; emits "unsupported AttributeTag") |
| 1 | IntegerAttr | StringAttr |
| 2 | FloatAttr | FloatAttr |
| 3 | BoolAttr | TypeAttr |
| 4 | TypeAttr | DenseElementsAttr (int/float) |
| 5 | StringAttr | DenseElementsAttr (string) |
| 6 | ArrayAttr | DivByAttr |
| 7 | DenseElements | DenseI64ArrayAttr (variant A) |
| 8 | DivByAttr | DenseI64ArrayAttr (variant B) |
| 9 | SameElementsAttr | SameElementsAttr |
| 10 | Dictionary | BoundedAttr (variant 0) |
| 11 | OptimizationHints | BoundedAttr (variant 1) |
| 12 | BoundedAttr | BoundedAttr (variant 2) |
| 13 | (no upstream slot) | AssumePredicateAttr |
Only tag 2 (FloatAttr) matches upstream by coincidence. Every other tag in the
1..13 range disagrees: tag 1 is StringAttr here versus upstream IntegerAttr;
tag 4 lands on DenseElementsAttr instead of TypeAttr; tag 5 lands on
DenseElementsAttr<string> instead of StringAttr; tag 6 lands on DivByAttr
instead of ArrayAttr. Going the other direction, an AssumePredicateAttr emitted
by tileiras at tag 13 has no destination in upstream's table at all. Any external
tool that needs to round-trip MLIR bytecode through both implementations must
freeze the tileiras numbering above; the upstream header is reserved for future
stock cuda_tile builds.
The parallel DebugTag namespace at sub_589B90 is private to the Debug section and
uses a dense [0..6] range. Tag 0 is the failure sentinel; tags 1-6 cover
DICompileUnit, DIFile, DILexicalBlock, DILoc, DISubprogram,
CallSite respectively. No upstream LLVM debug-info tag table participates in
this dispatcher.
Layer 4 — cuda_tile Opcode Space (sub_5B13D0)
The 110-row cuda_tile opcode table is dense in [0..109] with two reserved
holes the dispatcher leaves on the default arm:
| Range | Status |
|---|---|
0..24 | Assigned (absf through exp2) |
25..36 | Reserved hole — emits "unknown or unimplemented opcode: " |
37..51 | Assigned (exti through int_to_ptr) |
52..57 | Reserved hole — emits "unknown or unimplemented opcode: " |
58..109 | Assigned (iota through yield) |
Opcode 0x6E (atan2 in the public 13.2 namespace) is absent from this binary.
The dispatcher has no case for it and embeds no cuda_tile.atan2 mnemonic string;
encoding the op lands on the default arm. This places the binary at a 13.1-vintage
opcode-table snapshot.
The full per-opcode mnemonic / handler-address table lives in MLIR Bytecode Format — Operation Opcode Dispatch.
The location-index slot is signed zig-zag LEB128: the value 0x7F after zig-zag
decode is -1, which the dispatcher resolves to UnknownLoc (typical of a
--lineinfo-less compile).
Layer 5 — NVPTX MatcherTable Pools (XOR-3 Cipher)
The NVPTX AsmWriter ships two .data mnemonic pools obfuscated by a walking
XOR cipher. Byte i is XORed with (3 * i) mod 256, decoded once at startup,
and cached behind a pointer at qword_5B4F4D0.
void xor3_decode(uint8_t *begin, uint8_t *end) {
uint8_t key = 0;
for (uint8_t *p = begin; p != end; ++p) {
*p ^= key;
key = (uint8_t)(key + 3);
}
}
The two pools and their decoder entry points are:
| Pool | Range | Length | Decoder | Cached pointer |
|---|---|---|---|---|
| Opcode mnemonic | 0x5A4C080 .. 0x5A656F0 | ~105 KB | sub_1BD1810 | qword_5B4F4D0 |
| Physical-register-name | 0x5A4BE20 .. 0x5A4C06A | 586 B | sub_1BD1830 | (post-decode cached) |
The cipher is not a security boundary. Its only effect is to prevent a naive
strings(1) sweep from surfacing every PTX mnemonic. A reimplementation that
does not need binary-for-binary .data parity can store the same strings plainly.
The shape of the decoded pool matches LLVM 21's NVPTXGenAsmWriter output; the
pattern-name strings paired with each OPC_* row of the MatcherTable
("setmaxregister", "cp.async.bulk.tensor.group.shared.cluster",
"wgmma.mma_async.sync.aligned", "wgmma.fence.sync.aligned",
"tcgen05.mma.sync", "tcgen05.mma.ws.sync", "mma.block_scaled.sync.aligned",
"mma.sp.sync.aligned.m8n8k16") sit unencrypted in .rodata since they are
TableGen pattern records rather than printer-side mnemonic literals.
Layer 6 — NVPTX ProxyReg Whitelist
The post-ISel NVPTXProxyRegErasure peephole uses a contiguous opcode range
rather than a named whitelist. The TableGen-side consolidation that landed
in LLVM 21 trunk just before the 21.0 cut replaced the older per-type
ProxyRegInst<*> template with a four-way emit that produces adjacent
indices:
| MI opcode | Type class | TableGen name |
|---|---|---|
| 3156 | i16 | ProxyRegI16 |
| 3157 | i32 | ProxyRegI32 |
| 3158 | i64 | ProxyRegI64 |
| 3159 | f32 / f64 | ProxyRegF |
The check at 0x1AE5086 is sub eax, 0xC54 ; cmp eax, 3 — a contiguous range
test that costs two x86 instructions. Stock LLVM 18 used a 5-6-element named
whitelist, so the contiguous numbering is itself a fingerprint for the LLVM 21
NVPTX backend. Reimplementations cannot pick arbitrary opcode numbers for the
typed ProxyReg family without breaking the peephole's hot-path test.
Layer 7 — FTZ-Path Constants in SelectIntrinsic_W_Chain Case 0x66
The per-call FTZ override in case 0x66 of sub_1A854E0 carries two MI opcode
literals and one SDNode flag bit that must reproduce exactly:
| Constant | Value | Meaning |
|---|---|---|
| FTZ-path FMA opcode | 0x65 | FMA_FTZ; emitted when probe selects FTZ |
| Non-FTZ-path wrapper opcode | 0xF7 | FMA_NON_FTZ; emitted when probe selects IEEE |
| FTZ-authorization flag bit | 0x40 | NoFPExcept reinterpreted as per-node FTZ-authorize signal |
| Inner FMAD opcode | 0x63 | Set with NoFPExcept (0x200) on the FTZ four-instruction chain |
INST_WRAPPER opcode (non-FTZ) | 0xD2 | Holds chain through ADDRESSOF wrap |
CopyToReg opcode | 0x11 | Standard LLVM SDNode opcode |
| MUL_ADD_f32 / MUL_ADD_f64 | 207 / 208 | MVT-keyed select after the wrapper chain |
⚡ QUIRK —
NoFPExceptflag bit0x40repurposed as FTZ-authorization Upstream LLVM treats SDNode flag bit0x40(NoFPExcept) as a pure FP-exception -safety advisory: it tells later passes that no FP exception can be raised. In case0x66ofsub_1A854E0, tileiras reads the same bit before the"unsafe-fp-math"function attribute and treats it as a per-node "authorize FTZ substitution" signal. A combine that legitimately setsNoFPExcepton a single FMA in an otherwise IEEE-denormal function therefore silently switches that one FMA tofma.rn.ftz.f32(opcode0x65) instead of theFMA_NON_FTZwrapper (0xF7). A reimplementation that imports upstream flag semantics will produce different PTX for the same SDAG.
Layer 8 — cvt_packfloat Validator Constants (sub_1A84900)
The four-gate cvt_packfloat validator carries five subtarget-level constants:
| Constant | Value | Gate |
|---|---|---|
| SM major floor | 0x384 (sm_90) | Gate 1 |
| PTX version floor | 0x4D (PTX 7.7) | Gate 1 |
| sm_100a SM major | 0xA0 | Gates 2 and 3 (UE8M0x2, fp6x2/fp4x2) |
| sm_100f SM minor | 0xF | Gate 4 (family-conditional) |
tmem feature byte | offset 80 in subtarget feature array at unk_5BEBD51 | tcgen05 128-bit atomic guard at sub_1A80A40 |
⚡ QUIRK —
atleasttypo and mismatched PTX number in gate-one diagnostic Gate one's diagnostic string is"cvt_packfloat intrinsic needs atleast SM90 and PTX >= 78": the missing space inatleastis preserved byte-for-byte, and the message advertisesPTX >= 78even though the actual compare is against0x4D(PTX 7.7, not 7.8). The discrepancy stems from an internal NVIDIA test-suite log scraper that keys on the verbatim string. A reimplementer who "fixes" either the spelling or the number desyncs that scraper without changing behavior.
Layer 9 — LLVM 21 NVPTX Data-Layout Stamp
Every NVPTX module emitted by tileiras carries one verbatim data-layout string, unconditionally stamped before bitcode serialization:
e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-
v128:128:128-n16:32:64
Length: 154 bytes (0x9A). Rodata location: 0x4D079D0. Sole xref:
sub_1A4E5C0 at 0x1A4E5D1. Address space 3 (p3:32:32:32) marks
NVPTX shared memory as 32-bit-pointer. The string is byte-identical to
stock LLVM 21 NVPTX64, and is one of the ten independent fingerprints
that pin the LLVM base version in LLVM Fingerprint Table.
Layer 10 — LLVM Bitcode Producer Strings
Two .rodata strings stamp the LLVM base version into every emitted module:
| Slot | Rodata address | Length | Verbatim string |
|---|---|---|---|
IDENTIFICATION_CODE_STRING | 0x4F882C4 | 13 B | LLVM21.0.0git |
NVPTX AsmPrinter emitHeader line 3 | (inside sub_1A56540) | varies | Based on LLVM 21.0.0git |
| libNVVM module name (when libNVVM path is taken) | (compile-time literal) | 10 B | mlir-input |
The producer string is emitted as the bitcode-writer's IDENTIFICATION subblock
record at sub_3935490 (the EnterSubblock(IDENTIFICATION, 5) site). The
AsmPrinter header comment block is written at every PTX-emit invocation; the
third line of four is the verbatim Based on LLVM 21.0.0git literal, not a
runtime-formatted template.
Cross-Layer Constant Index
For a reimplementation walking the wire format top-down, the constants converge on a small handful of source-of-truth dispatchers. The index below maps each constant back to the page that documents its dispatch site at reimplementation depth.
| Layer | Constant class | Authority page |
|---|---|---|
| 1 | Magic bytes, version range, section IDs | MLIR Bytecode Format |
| 2 | TypeTag 0..18 | MLIR Bytecode Format — Type Tag Dispatch |
| 3 | AttrTag 0..13, DebugTag 0..6 | MLIR Bytecode Format — Self-Contained Attribute Dispatch |
| 4 | cuda_tile opcodes 0..109, reserved holes | MLIR Bytecode Format — Operation Opcode Dispatch |
| 5 | XOR-3 cipher, pool ranges | ISelDAG and MatcherTable — AsmWriter String Tables |
| 6 | ProxyReg whitelist [3156, 3159] | LLVM Fingerprint Table — Fingerprint 8 |
| 7 | FMA opcodes 0x65 / 0xF7, flag bit 0x40 | ISelDAG and MatcherTable — NVIDIA-Specific ISel Patches |
| 8 | cvt_packfloat SM/PTX floors, tmem feature byte | ISelDAG and MatcherTable — NVIDIA-Specific ISel Patches |
| 9 | NVPTX64 data-layout string | LLVM Fingerprint Table — Fingerprint 1 |
| 10 | LLVM21.0.0git, Based on LLVM 21.0.0git, mlir-input | LLVM Fingerprint Table — Fingerprints 2, 3 |
Reimplementation Contract
Three rules summarize the constraint these constants impose on a clean-room reimplementation:
- Magic, AttrTag numbering, and cuda_tile opcode table are wire-format
invariants. A reimplementation that picks any other byte for offset 7,
any other tag-to-attribute-kind mapping in
sub_59F100's switch, or any other opcode-to-mnemonic assignment insub_5B13D0's switch produces bytecode that the shipped reader either rejects or silently mis-decodes. - NVPTX MatcherTable pool ranges, ProxyReg numbering, and FMA opcode
numbers are emitter invariants. A reimplementation that ships different
bytes here still produces valid PTX, but the binary-for-binary
.dataand MIR cross-checks NVIDIA's internal regression suite runs against tileiras output will fail. - All diagnostic strings — including the
atleasttypo, thePTX >= 78off-by-one, theFileLineColLocdebug-attr naming inheritance — are contract surface. Test-suite log scrapers key on verbatim spelling. "Fixing" any of them is a behavioral change as far as downstream tools are concerned, even though the fix is locally correct.
The shared property across all three rules is that no constant in this page is configuration. Each is either a header-stamped invariant frozen at build time, a table TableGen emitted into the binary at LLVM 21 cut-time, or a literal NVIDIA chose for hand-rolled validator code. A reimplementation that wants compatibility must freeze every one of them.