Host Launch ABI + ptxas Knobs

Abstract

tileiras is assembler-side. It never calls cuLaunchKernel, cuLaunchKernelEx, or cuKernelSetAttribute directly — instead it emits kernel launch metadata into IR attributes and PTX directives, and ptxas lifts that information into the cubin metadata consumed by the CUDA runtime or driver.

The host-visible launch ABI splits across three channels:

PTX directives in each kernel .entry header.
MLIR nvvm.* attributes on lowered LLVM functions.
gpu.launch_func and nv_tileaa.launch_func properties that carry dynamic launch operands through the lowering pipeline.

The ptxas --knobs-file=<path> path is separate. tileiras forwards the argument only when the environment gate is enabled; ptxas owns the file grammar and every diagnostic.

Host-side launch ABI

Since the driver never synthesizes CUDA-driver launch calls, the compiled cubin carries static launch metadata and leaves dynamic launch assembly to the consumer. Static metadata flows from nvvm.* attributes and PTX directives; dynamic metadata rides on launch-operation properties and segment-size arrays during MLIR lowering.

The split that matters:

Channel	Carrier	Purpose
Static thread shape	`nvvm.maxntid`, `nvvm.reqntid`, `.maxntid`, `.reqntid`	Communicates block shape constraints.
Static cluster shape	`nvvm.cluster_dim`, `.reqnctapercluster`, `.maxclusterrank`	Communicates SM90+ cluster launch constraints.
Static register budget	`nvvm.maxnreg`, `.maxnreg`	Communicates register budget to ptxas.
Static CTA residency hint	`nvvm.minctasm`, `.minnctapersm`	Communicates minimum CTAs per SM.
Dynamic operands	`operandSegmentSizes`	Preserves launch operand partitioning through lowering.
Dynamic shared memory	launch operand segment	Eventually drives `%dynamic_smem_size` in PTX/SASS.

Cluster directives are gated to SM90 and newer. On older targets the compiler suppresses .blocksareclusters, .explicitcluster, .reqnctapercluster, and .maxclusterrank even when cluster-shaped metadata is present upstream.

gpu.launch_func carries kernelFunc, kernelModule, and operandSegmentSizes. The setter also accepts the older operand_segment_sizes spelling for compatibility with MLIR v17-era IR. By the nv_tileaa.launch_func stage the kernel reference flattens into a single kernel property alongside the same operand segment sizing.

nvvm.* Annotations and PTX Directives

The nvvm.* attribute family is the canonical in-IR carrier of launch metadata. Legacy !nvvm.annotations tuples still parse and can be transplanted into attribute form; an internal marker prevents repeated legacy scans after the transplant.

The verifier enforces the shape rules that matter:

Dimensional attributes contain one to three i32 values, except cluster dimensions, which require three values.
Scalar resource attributes are integer attributes.
nvvm.blocksareclusters requires both nvvm.reqntid and nvvm.cluster_dim on the same function.

Kind	Attribute name	Shape	PTX projection	Target gate
kernel	`nvvm.kernel`	UnitAttr	`.entry` instead of `.func`	all SMs
maxntid	`nvvm.maxntid`	1..3 `i32` values	`.maxntid`	all SMs
reqntid	`nvvm.reqntid`	1..3 `i32` values	`.reqntid`	all SMs
cluster_dim	`nvvm.cluster_dim`	exactly 3 `i32` values	`.explicitcluster`, `.reqnctapercluster`	SM90+
minctasm	`nvvm.minctasm`	integer	`.minnctapersm`	all SMs
maxnreg	`nvvm.maxnreg`	integer	`.maxnreg`	all SMs
maxclusterrank	`nvvm.maxclusterrank`	integer	`.maxclusterrank`	SM90+
blocksareclusters	`nvvm.blocksareclusters`	UnitAttr	`.blocksareclusters`	SM90+
grid_constant	`nvvm.grid_constant`	1-based argument index list	Drives constant-argument layout	all SMs
annotations_transplanted	`nvvm.annotations_transplanted`	UnitAttr	Internal marker only	all SMs

Several invariants are core for reimplementers. nvvm.maxclusterrank is stored as an integer-valued function attribute, unlike the string-shaped legacy forms used by some older launch metadata. local_maxnreg has no new nvvm.* mirror — it stays legacy-only and is never printed as a PTX directive by this stage. When updating dimensional attributes, write every axis back together so the new attribute form stays coherent even when the legacy source used split per-axis tuples.

PTX emission walks the verified attribute set in a fixed order so that related directives stay adjacent in the kernel header. The thread-shape group (.maxntid, .reqntid) emits first, followed by the residency hints (.minnctapersm, .maxnreg), and finally the cluster group (.blocksareclusters, .explicitcluster, .reqnctapercluster, .maxclusterrank) when the target supports clusters. Both .maxntid and .reqntid may appear on the same kernel — the PTX semantics make them complementary: .reqntid declares an exact block shape the kernel relies on, .maxntid declares an upper bound for register-pressure budgeting. The verifier checks shape consistency but does not collapse or override either directive, and the emitter prints both as written when both are set.

void emit_launch_directives(LLVMFuncOp fn, Target target, PTXWriter &out) {
    if (auto dims = get_dim_attr(fn, "nvvm.maxntid"))
        out.directive(".maxntid", *dims);
    if (auto dims = get_dim_attr(fn, "nvvm.reqntid"))
        out.directive(".reqntid", *dims);

    if (auto n = get_int_attr(fn, "nvvm.minctasm"))
        out.directive(".minnctapersm", *n);
    if (auto n = get_int_attr(fn, "nvvm.maxnreg"))
        out.directive(".maxnreg", *n);

    if (!target_supports_clusters(target))
        return;                              // suppress all cluster directives pre-SM90

    if (fn->hasAttr("nvvm.blocksareclusters"))
        out.directive(".blocksareclusters");
    if (auto dims = get_dim_attr(fn, "nvvm.cluster_dim")) {
        out.directive(".explicitcluster");
        out.directive(".reqnctapercluster", *dims);
    }
    if (auto n = get_int_attr(fn, "nvvm.maxclusterrank"))
        out.directive(".maxclusterrank", *n);
}

Two structural invariants keep this loop from being more complex. nvvm.blocksareclusters is verified to require both nvvm.reqntid and nvvm.cluster_dim on the same function, so by the time emission runs the three directives are guaranteed to form a coherent triple. Cluster directives are suppressed wholesale on pre-SM90 targets; the verifier permits the attributes upstream so a single IR module can lower for multiple targets, but the per-target emitter refuses to print them when ptxas would reject the result.

How tileiras chooses each directive

Verifying an attribute is well-formed is not the same as choosing its value. The well-formedness rules above guard against malformed PTX; the choice of value is what determines whether the kernel runs at all and how fast it runs when it does. Each directive has its own input channel — the kernel-spec attribute the upstream lowering attaches, a user-supplied annotation that survives the front-end, or a constraint imposed by an instruction the compiler emitted later. The table below walks the policy for each directive.

Directive	Primary input	Policy
`.entry kernel_name`	`nvvm.kernel` marker on the LLVM function	always emit when the marker is present; the function name is the symbol the cubin exposes
`.maxntid X, Y, Z`	upper-bound hint from kernel-spec or DSL annotation	emitted when the bound is not also a hard contract; lets ptxas size the register fragment without pinning the launch shape
`.reqntid X, Y, Z`	hard contract from kernel-spec — warp-specialized split, warp-group requirement, or named-warp partition	emitted when the lowering depends on an exact thread count (every WGMMA or tcgen05 user, every kernel with named producer/consumer warps)
`.minnctapersm N`	occupancy floor from kernel-spec	emitted when the user requested a minimum residency, usually for kernels whose throughput is sensitive to warp-scheduler latency hiding
`.maxnreg N`	per-thread register budget from kernel-spec	emitted to let ptxas trade registers for occupancy — typical values come from a kernel-specific computation of `accumulator_regs + working_regs + slack`
`.explicitcluster`	implied by `nvvm.cluster_dim` presence	always emitted with `.reqnctapercluster` when the kernel is cluster-shaped on SM90+
`.reqnctapercluster X, Y, Z`	cluster shape from kernel-spec	emitted on SM90+ when `nvvm.cluster_dim` is present; suppressed wholesale on older targets
`.maxclusterrank N`	portability cap from kernel-spec	emitted on SM90+ when the user wants a portable launch shape, capping cluster size below the device-specific maximum
`.blocksareclusters`	only legal when `.reqntid` and `.cluster_dim` are also present	emitted on SM90+ for kernels that opt into the single-CTA-cluster convention; lets cluster-aware code paths execute on a degenerate cluster shape

The .maxntid versus .reqntid distinction is the policy decision that affects the most kernels. .maxntid is an upper bound the launch must respect but does not have to saturate; the driver accepts any launch shape with X, Y, Z components no larger than the declared maxima. .reqntid is a hard contract — the driver rejects any launch whose block shape does not match the declared values exactly. Tileiras emits .reqntid whenever the lowering has already baked in a specific thread count: any kernel that emits WGMMA needs 128 threads per CTA (four warps form one warp group), any kernel with warp-specialized producer/consumer splits needs the exact named warp count, and any kernel with named-warp NamedBarrier slots needs the exact thread count the slot binding assumed. For kernels that adapt to launch shape — elementwise kernels, kernels that use only synchronous mma.sync forms, kernels with no warp specialization — tileiras emits only .maxntid so the same cubin works for a range of launch shapes.

The .maxnreg choice is similarly central to performance. A WGMMA-using kernel must leave room for the accumulator fragment: an m64n256k16 FP32 WGMMA needs 32 FP32 registers per thread just for the accumulator, plus the working set for descriptors, loop indices, and any other live values. Setting .maxnreg too low forces ptxas to spill the accumulator to local memory, which silently regresses throughput by an order of magnitude. Setting it too high reduces occupancy and hurts latency hiding. The kernel-spec carries the result of a per-kernel computation that balances both — usually accumulator_regs + descriptor_regs + slack, with slack calibrated to the SM's register file size and the desired CTAs per SM.

Cluster-shape directives are an all-or-nothing group. When the kernel-spec carries nvvm.cluster_dim, the lowering emits .explicitcluster, .reqnctapercluster, and any nvvm.maxclusterrank or nvvm.blocksareclusters markers; when the spec is silent, no cluster directive is emitted. The verifier rule that nvvm.blocksareclusters requires both nvvm.reqntid and nvvm.cluster_dim means the three-directive triple is always coherent by the time the emitter sees it.

GPU Execution Model is the canonical reference for how the five tiers (thread, warp, CTA, cluster, grid) consume these directives at runtime, with a worked example that traces the directive emission from kernel-spec to PTX header for a Hopper GEMM.

ptxas Knobs File Format

When both MLIR_ENABLE_EVO and PTX_KNOBS_PATH are set, tileiras forwards --knobs-file=<path> to ptxas. It does not parse or validate the file — the grammar belongs to ptxas.

The file format is:

arbitrary preamble
[knobs]
command command command

The [knobs] sentinel is case-sensitive; text before it is ignored. After the sentinel, whitespace, ~, and ;; separate commands. The command stream has no quoting, no escaping, no comment syntax.

Commands have three forms:

Form	Meaning
`identifier=value`	Assign a knob value. The `=` is accepted but not always required by ptxas.
`WHEN ...`	Parse a conditional knob clause.
`INJECTSTRING ... ;;`	Parse an internal SASS-splice string terminated by `;;`.

Values parse per the knob descriptor type. The recovered parser accepts signed and unsigned integers, integer ranges, integer lists, 32-bit and 64-bit floats, strings, pointers, opcode lists, opcode-pair lists, and WHEN clauses. Integer parsing is decimal — a string like 0x10 parses as zero, with the trailing text ignored by the numeric conversion path.

Malformed knob files are fatal to the ptxas child process. Duplicate assignments follow a last-wins policy: the later command overwrites the earlier runtime value. Identifier matching is case-insensitive from the user's point of view.

tileiras runs no preflight check that the path exists, contains [knobs], or uses valid identifiers. Every knob-file diagnostic comes from ptxas and surfaces through the normal subprocess diagnostic buffer.

Driver Overview covers how the produced kernel directives travel into the relocatable object; Driver CLI Options catalogues the user-visible flags that map into pipeline options; ptxas Handoff Protocol documents the ptxas-side knob-file grammar in detail; Attribute System and Lowering documents the full lifecycle of each launch-shape attribute from frontend hint through nvvm.* directive carrier, including which transitions silently drop the fact and produce a degraded kernel.

Keyboard shortcuts