Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Driver Program Handle

Abstract

A single 104-byte program handle represents one TileIR translation unit in the tileiras driver. The public C-API surface exposes only an opaque pointer, but the recovered allocation is a fixed 0x68-byte block built by sub_57A480 (tileirasProgramCreate) and consumed by sub_57A8E0 (tileirasProgramCompile), sub_57A850 (tileirasProgramGetOutput), and sub_57A7C0 (tileirasProgramRelease). Every offset is reachable from the four public entry points, and the layout stays stable across the three create/compile/release call sites in the driver binary.

The handle stores the validated driver configuration, a small ownership bit, and an inline byte view that doubles as a CUDA-root pointer during early lifetime and as the compiled-output byte span after compile. Storage at +0x48 lives a two-phase life: the slot holds the resolved CUDA install root while the front end runs, then the same 16 bytes are repurposed to track the compiled output buffer once sub_57A8E0 finishes.

Public Error Codes

Every entry point in the driver's public C API returns a small integer status. Five non-zero codes are emitted across the four functions, and every diagnostic routes through sub_578D40 with a packed severity byte. The severity values 259, 260, and 2563 are the (class | (op_prefix << 8) | (trace << 9)) form documented in Diagnostic ABI and Helpers: 259 and 260 are fatal driver errors, 2563 is a user-input rejection.

CodeTriggerVerbatim diagnosticSeverity
0success(none)
1allocation failure (sub_44A8C20(0x68) returns NULL)failed to allocate memory for program259
2null config (out_program ok)configuration is null2563
2null inputBufferinput buffer is null2563
2opt_level != 0 && device_debug == 1optimized debugging is not supported, change optimization level to 0 or disable full debug info2563
2invalid GPU id (not in {100, 103, 110, 120, 121})unsupported GPU target2563
2opt_level > 3invalid optimization level2563
2unsupported host arch (not in {0, 1, 2})unsupported host architecture2563
2unsupported host OS (not in {0, 1})unsupported host operating system2563
3parse failure on TileIR bytecode magicinput does not correspond to Tile IR bytecode260
3parse failure with MLIR fall-throughfailed to parse IR bytecode (it looks like MLIR bytecode instead)260
4null program (Compile, GetOutput, Release)program is null2563
4null output pointer (GetOutput)output pointer is null2563
4output requested before compile (GetOutput)program has not been compiled2563
5compile failure (tileirasProgramCompile)failed to compile Tile IR program259

Code 1 is reserved for the single allocation site at sub_44A8C20(0x68). Code 2 covers every configuration-level rejection. Code 3 is reserved for bytecode-parse failures and is the only code with a conditional suffix appended by a magic-tail heuristic. Code 4 is the shared null-handle / not-compiled rejection used by every entry point other than create. Code 5 is the compile-time failure code emitted by sub_57A8E0.

The MLIR fall-through on code 3 is a small heuristic inside sub_57A480. When the bytecode magic check fails, the function scans the input for the 8-byte ASCII tail e IR byt — the suffix of MLIR bytecode minus the leading ML — and on a match appends (it looks like MLIR bytecode instead) to the base code-3 diagnostic so the user can route their bytecode to the right tool.

tileirasProgramCreate Validation Order

sub_57A480 is an 820-byte routine that funnels every diagnostic through sub_578D40. Validation order is fixed and observable from the call sites — a caller can rely on each check happening before the next, because every early return is a separate diagnostic with a separate severity field.

int tileirasProgramCreate(TileirasProgram **out_program, const TileirasConfig *config) {
    if (out_program == NULL)                                    return 4;  // "program is null"
    if (config == NULL)                                         return 2;  // "configuration is null"
    if (config->input_buffer_data == NULL)                      return 2;  // "input buffer is null"
    if (!sub_57FF40(config->input_buffer_data,
                    config->input_buffer_size))                 return 3;  // parse probe; MLIR tail check appends suffix
    if (!is_supported_gpu(config->gpu_id))                      return 2;  // "unsupported GPU target"
    if ((uint32_t)config->opt_level > 3)                        return 2;  // "invalid optimization level"
    if (config->opt_level != 0 && config->device_debug == 1)    return 2;  // "optimized debugging is not supported, ..."
    if ((uint32_t)config->host_arch > 2)                        return 2;  // "unsupported host architecture"
    if ((uint32_t)config->host_os > 1)                          return 2;  // "unsupported host operating system"

    TileirasProgram *p = (TileirasProgram *)sub_44A8C20(0x68);
    if (p == NULL)                                              return 1;  // "failed to allocate memory for program"

    copy_config_into_handle(p, config);
    p->owning_flag = 0;
    *out_program = p;
    return 0;
}

The eight predicate gates are pure functions of the caller's argument tuple. Only after every predicate passes does sub_57A480 request the 104-byte block from the allocator. Allocation is the last possible failure point, so a successful return from tileirasProgramCreate guarantees the handle is initialized end-to-end.

tileirasConfig Layout

The configuration is a 16-byte aligned block whose first 80 bytes mirror the front of the program handle. sub_57A480 reads it through five _mm_loadu_si128 loads plus one scalar slot, pinning the layout to exactly five 16-byte rows.

typedef struct TileirasConfig {
    /*+0x00*/ const void *input_buffer_data;     // bytecode bytes
    /*+0x08*/ size_t      input_buffer_size;     // bytes in buffer
    /*+0x10*/ int32_t     gpu_id;                // 100/103/110/120/121 (sub_57A450 whitelist)
    /*+0x14*/ int32_t     opt_level;             // 0..3
    /*+0x18*/ int32_t     device_debug;          // 0 or 1
    /*+0x1C*/ int32_t     lineinfo;              // 0 or 1
    /*+0x20*/ int32_t     host_arch;             // 0=x86_64, 1=aarch64, 2=arm64ec
    /*+0x24*/ int32_t     host_os;               // 0=linux, 1=windows
    /*+0x28*/ int32_t     sanitize;              // 0=off, 1=memcheck
    /*+0x2C*/ uint32_t    pad_2C;
    /*+0x30*/ /* std::string SSO */              // output file name (driver side)
    /*+0x40*/ const char *cuda_root_ptr;         // resolved by sub_5773C0
    /*+0x48*/ size_t      cuda_root_len;
} TileirasConfig;

The fields at +0x10..+0x28 are the validated driver options. The CUDA root string at +0x40 is the resolution of the CUDA_ROOT / CUDA_HOME / CUDA_PATH environment chain (with /proc/self/exe fallback) performed by sub_5773C0; tileiras carries it into the program handle because the compile dispatch needs an installation path to invoke nvdisasm.

TileirasProgram Layout

The 104-byte program handle reuses the first 80 bytes of the configuration almost verbatim, with one deliberate field reorder: gpu_id moves to +0x28 so the validated 32-bit fields at +0x10..+0x28 form a contiguous scalar block that the compile dispatch reads via aligned 32-bit loads.

typedef struct TileirasProgram {
    /*+0x00*/ const void *input_buffer_data;     // ptr to bytecode bytes
    /*+0x08*/ size_t      input_buffer_size;     // bytes in buffer
    /*+0x10*/ int32_t     opt_level;             // 0..3
    /*+0x14*/ int32_t     device_debug;          // 0 or 1
    /*+0x18*/ int32_t     lineinfo;              // 0 or 1
    /*+0x1C*/ int32_t     host_arch;             // 0=x86_64, 1=aarch64, 2=arm64ec
    /*+0x20*/ int32_t     host_os;               // 0=linux, 1=windows
    /*+0x24*/ int32_t     sanitize;              // 0 or 1 (memcheck)
    /*+0x28*/ int32_t     gpu_id;                // 100/103/110/120/121
    /*+0x2C*/ uint32_t    pad_2C;
    /*+0x40*/ const char *cuda_root_ptr;         // resolved at compile time
    /*+0x48*/ size_t      cuda_root_len;         // ── overlapping slot ──
    /*+0x48*/ uint8_t    *output_data;           // SSO: same 16 bytes, second life
    /*+0x50*/ uint64_t    output_capacity;
    /*+0x58*/ size_t      output_length;
    /*+0x60*/ uint32_t    owning_flag;           // 1 = handle owns output_data
} TileirasProgram;

Five unaligned SSE copies from the configuration populate the block. The first copy moves input_buffer_data and input_buffer_size; the next two move the eight 32-bit option fields; the fourth clears the slot at +0x30; the fifth installs the CUDA-root SSO. owning_flag at +0x60 clears to zero before sub_57A480 returns, so the handle starts life with no output buffer to free.

SSO Overlap at +0x48

The slot at +0x48 is the only address in the handle that hosts two different fields across the program lifetime. While the front end runs, +0x40..+0x4F carries a (cuda_root_ptr, cuda_root_len) pair pointing at the resolved CUDA install path. The compile dispatcher reads the pair once, uses it to locate the nvdisasm binary, and never touches it again. The same 16 bytes are then overwritten to hold the compiled-output buffer descriptor: output_data at +0x48, output_capacity at +0x50, output_length at +0x58.

The offsets used by the consumers make this observable. sub_57A8E0 writes output_data at +72 (0x48), output_capacity at +80 (0x50), and output_length at +88 (0x58). sub_57A850 returns {ptr=+0x48, length=+0x58} to the caller. The lifetimes never overlap: by the time sub_57A8E0 installs the output bytes, the CUDA-root string has already been consumed and is no longer needed by any subsequent stage.

A reimplementation should keep this overlap as an internal storage trick and never expose it to callers. The public contract is simply that the program-output getter is invalid until compile has succeeded.

Ownership and Release

A single bit at +0x60, owning_flag, controls release behavior. It starts at 0 immediately after tileirasProgramCreate. sub_57A8E0 sets it to 1 once the compiled byte buffer has landed at +0x48..+0x58. sub_57A7C0 (tileirasProgramRelease) reads the flag before tearing down: 1 makes the release path call the buffer's deleter on output_data; 0 leaves the slot alone. Either way, the 104-byte handle itself is freed via the matching sub_4580C60 deallocator after the conditional output free.

The ownership rule means calling tileirasProgramRelease on a never-compiled handle is safe and touches no output memory. It also means the only legal way to retain compiled bytes after release is to copy them out via tileirasProgramGetOutput first.

Public-API Surface

Four C-API entry points operate on the handle. Each one is a small wrapper that validates its arguments, walks a fixed offset path, and routes diagnostics through sub_578D40.

SymbolIdentityRole
sub_57A480tileirasProgramCreateValidates (out_program, config, inputBuffer), allocates 0x68 bytes via sub_44A8C20, copies the configuration body, clears owning_flag.
sub_57A8E0tileirasProgramCompileReads option fields at +0x10..+0x28, runs the compile dispatcher, writes the output buffer at +0x48..+0x58, sets owning_flag at +0x60 to 1.
sub_57A850tileirasProgramGetOutputReturns {ptr = *(uint8_t**)(handle+0x48), length = *(size_t*)(handle+0x58)}; emits program has not been compiled (code 4) if owning_flag is 0.
sub_57A7C0tileirasProgramReleaseIf owning_flag is 1, frees output_data; then frees the 104-byte handle via sub_4580C60.

All four entry points share the same C-API error space. The diagnostic emitter at sub_578D40 is the single sink for every public message, which is why codes overlap across functions: code 4 covers program is null, output pointer is null, and program has not been compiled from the getter and release sites, while code 5 is reserved for the compile-time failure path inside sub_57A8E0.

Lifecycle

The normal lifetime runs create → compile → get output → release, and the driver's main follows that sequence exactly. The handle is opaque at the public surface; the offset layout above is an implementation detail that reimplementers should reproduce for binary compatibility with the recovered driver but should never surface to consumers.

TileirasProgram *program = NULL;
int err = tileirasProgramCreate(&program, &config);
if (err == 0) {
    err = tileirasProgramCompile(program);
}
if (err == 0) {
    TileirasByteView out;
    err = tileirasProgramGetOutput(program, &out);
    if (err == 0) {
        write_output_file(opts.output_path, out.data, out.length);
    }
}
tileirasProgramRelease(program);

Compile is allowed to fail after a successful create. When it does, sub_57A8E0 returns code 5 (failed to compile Tile IR program), owning_flag stays 0, the output slot still holds the CUDA-root pair, and release frees only the handle. The getter rejects subsequent calls with code 4 (program has not been compiled), preserving the invariant that a returned byte view stays valid until the next mutating call on the handle.

Reimplementation Invariants

The handle is 104 bytes. The option fields at +0x10..+0x28 are eight contiguous 32-bit integers in the order (opt_level, device_debug, lineinfo, host_arch, host_os, sanitize, gpu_id, pad). The slot at +0x48..+0x57 is a single 16-byte region with two sequential lifetimes — CUDA-root SSO during the front end, output-buffer descriptor afterwards. The byte at +0x60 is the ownership flag, the only field tileirasProgramRelease inspects when deciding whether to free the output buffer. Allocation goes through sub_44A8C20(0x68) and deallocation through sub_4580C60; no intermediate reallocation occurs on the create or compile paths.

Cross-References

Driver main() Entry documents how the handle is threaded through the four-phase driver and how the configuration is built from cl::opt storage. The compile dispatcher that mutates the handle is described in the TileIR Pipeline Overview.