NVPTX Subtarget and Feature Matrix
Abstract
Two stock LLVM subtarget tables identify an SM target: one lists every
accepted CPU string, one lists every individual feature bit. Each CPU row
carries a feature mask describing what that CPU implies. The runtime
NVPTXSubtarget copies the selected CPU mask, ORs in explicit -mattr
features, and answers hasFeature(idx) queries from the final bitset.
The reimplementation contract is four-fold: keep the 40 CPU strings sorted
lexicographically so std::lower_bound works; keep the 81 feature indices
stable so bit positions do not drift; use one generic scheduling model for
every CPU; and expose the tmem feature as the gate for tensor-memory and
tcgen05 paths.
The Two TableGen Tables
NVPTXSubtarget is built from three arrays: the CPU table, the feature table, and
a parallel CPU-name StringRef array used for early validation of -mcpu.
struct SubtargetSubTypeKV {
const char *cpu_key;
uint64_t implies[6];
uint64_t tune_implies[6];
const MCSchedModel *sched_model;
};
struct SubtargetFeatureKV {
const char *key;
const char *description;
uint64_t value;
uint64_t implies[6];
};
Both tables are sorted: CPU rows by ASCII-lexicographic compare of the CPUKey string, feature rows by Key. That makes std::lower_bound against either array the canonical lookup path at runtime. Lexicographic CPU order produces one initially confusing artifact: "sm_100" < "sm_100a" < "sm_100f" < "sm_101" < ... < "sm_121f" < "sm_20" < "sm_21" < ... < "sm_90" < "sm_90a". The Blackwell sm_1NN family appears before the legacy sm_2N/sm_3N/.../sm_9N family for the simple reason that '1' < '2' in ASCII; rows are sorted by string, not by silicon generation.
The third array mirrors the CPU table as (pointer, length) pairs. Its only
job is early -mcpu validation before a full subtarget object exists.
The 81 Feature Indices
The full feature index table follows. Indices are stable across builds — a TableGen renumber would change PTX bit positions and break every cubin produced against this drop. Each row's Implies bitset is zero, so the only way a CPU acquires a feature bit is through the SubTypeKV row's Implies mask.
idx Feature Description
0 fma-level=0 FP fused-multiply-add fusion disabled
1 fma-level=1 FMA fusion for FP32 only
2 fma-level=2 FMA fusion everywhere (cicc default)
3 ptx32 Use PTX version 32
4 ptx40 Use PTX version 40
5 ptx41 Use PTX version 41
6 ptx42 Use PTX version 42
7 ptx43 Use PTX version 43
8 ptx50 Use PTX version 50
9 ptx60 Use PTX version 60
10 ptx61 Use PTX version 61
11 ptx62 Use PTX version 62
12 ptx63 Use PTX version 63
13 ptx64 Use PTX version 64
14 ptx65 Use PTX version 65
15 ptx70 Use PTX version 70
16 ptx71 Use PTX version 71
17 ptx72 Use PTX version 72
18 ptx73 Use PTX version 73
19 ptx74 Use PTX version 74
20 ptx75 Use PTX version 75
21 ptx76 Use PTX version 76
22 ptx77 Use PTX version 77
23 ptx78 Use PTX version 78
24 ptx80 Use PTX version 80
25 ptx81 Use PTX version 81
26 ptx82 Use PTX version 82
27 ptx83 Use PTX version 83
28 ptx84 Use PTX version 84
29 ptx85 Use PTX version 85
30 ptx86 Use PTX version 86
31 ptx87 Use PTX version 87
32 ptx88 Use PTX version 88
33 prec-divf32=0 See definition in NVPTXISelLowering.cpp
34 prec-divf32=1 See definition in NVPTXISelLowering.cpp
35 prec-divf32=2 See definition in NVPTXISelLowering.cpp
36 prec-divf32=3 See definition in NVPTXISelLowering.cpp
37 prec-sqrtf32=0 See definition in NVPTXISelLowering.cpp
38 prec-sqrtf32=1 See definition in NVPTXISelLowering.cpp
39 sm_20 Target SM 20
40 sm_21 Target SM 21
41 sm_30 Target SM 30
42 sm_32 Target SM 32
43 sm_35 Target SM 35
44 sm_37 Target SM 37
45 sm_50 Target SM 50
46 sm_52 Target SM 52
47 sm_53 Target SM 53
48 sm_60 Target SM 60
49 sm_61 Target SM 61
50 sm_62 Target SM 62
51 sm_70 Target SM 70
52 sm_72 Target SM 72
53 sm_73 Target SM 73
54 sm_75 Target SM 75
55 sm_80 Target SM 80
56 sm_82 Target SM 82
57 sm_86 Target SM 86
58 sm_89 Target SM 89
59 sm_90 Target SM 90
60 sm_90a Accelerated Target SM 90
61 sm_100 Target SM 100
62 sm_100a Accelerated Target SM 100
63 sm_100f Family Conditional Target SM 100
64 sm_101 Target SM 101
65 sm_101a Accelerated Target SM 101
66 sm_101f Family Conditional Target SM 101
67 sm_103 Target SM 103
68 sm_103a Accelerated Target SM 103
69 sm_103f Family Conditional Target SM 103
70 sm_110 Target SM 110
71 sm_110a Accelerated Target SM 110
72 sm_110f Family Conditional Target SM 110
73 sm_120 Target SM 120
74 sm_120a Accelerated Target SM 120
75 sm_120f Family Conditional Target SM 120
76 sm_121 Target SM 121
77 sm_121a Accelerated Target SM 121
78 sm_121f Family Conditional Target SM 121
79 sharedmem32bitptr Use 32 bit ptrs for Shared Memory
80 tmem Has support for Tensor Memory
Indices 0..38 cluster the orthogonal compiler-knob features: three FMA-fusion levels, thirty PTX-version selectors, four FP32-division precision settings, and two FP32-sqrt precision settings. The driver layer (cicc / nvcc) sets these through -mattr=+ptxNN and -mattr=+fma-level=N flags alongside -mcpu=sm_NN; tileiras itself never propagates a PTX-version bit from any CPU row. Indices 39..78 cover the 40 SM-target feature bits, one per CPU row, in lexicographic CPU order. Index 79 is the Fermi-legacy sharedmem32bitptr toggle. Index 80 is the only NVIDIA-proprietary feature in the entire build: tmem, "Has support for Tensor Memory", absent from upstream LLVM 18.1.4 / 19 NVPTX, and the cross-feature implication that distinguishes datacenter Blackwell from consumer Blackwell.
The PTX-version selector range stops at ptx88 — three versions past upstream LLVM 19 (capped at ptx86) and six past LLVM 18.1.4 (capped at ptx82). ptx88 aligns with the CUDA 13.1 toolchain vintage that produced this binary.
The 40 CPU Rows
The 40 CPU rows appear below in lexicographic table order, grouped by
silicon generation for readability. Each entry lists the row index, the
feature bit for the CPU itself, the known ELF target byte when the cubin
reader recognizes one, and whether the CPU implies tmem.
Row CPU FeatKV ELF byte TMem Variant Family
--- -------- ------ ---------- ---- -------- -------------------------------------------
18 sm_20 39 0x14 no base Fermi GF1xx
19 sm_21 40 0x15 no base Fermi GF11x
20 sm_30 41 0x1E no base Kepler GK10x
21 sm_32 42 0x20 no base Kepler (Tegra K1 / Logan)
22 sm_35 43 0x23 no base Kepler GK110 / GK11x
23 sm_37 44 0x25 no base Kepler GK210
24 sm_50 45 0x32 no base Maxwell GM10x
25 sm_52 46 0x34 no base Maxwell GM20x -- DEFAULT CPU
26 sm_53 47 0x35 no base Maxwell (Tegra X1 / Erista)
27 sm_60 48 0x3C no base Pascal GP100 (datacenter)
28 sm_61 49 0x3D no base Pascal GP10x (consumer)
29 sm_62 50 0x3E no base Pascal Tegra X2 / Parker / Drive-PX2
30 sm_70 51 0x46 no base Volta GV100
31 sm_72 52 0x48 no base Volta (Xavier)
32 sm_73 53 (gap) no base placeholder; no known HW product
33 sm_75 54 0x4B no base Turing TU10x
34 sm_80 55 0x50 no base Ampere GA100 (datacenter)
35 sm_82 56 (gap) no base placeholder; no known HW product
36 sm_86 57 0x56 no base Ampere GA10x (consumer)
37 sm_89 58 0x59 no base Ada Lovelace AD10x
38 sm_90 59 0x5A no base Hopper GH100
39 sm_90a 60 0x5A+0x800 no a Hopper GH100 + WGMMA/TMA arch-cond
0 sm_100 61 0x64 no base Blackwell datacenter GB100/GB200/B100/B200
1 sm_100a 62 (gap) YES a Blackwell datacenter + tcgen05 arch-cond
2 sm_100f 63 (gap) YES f Blackwell datacenter + tcgen05 family-cond
3 sm_101 64 (gap) no base Blackwell datacenter (reserved variant)
4 sm_101a 65 (gap) YES a Blackwell datacenter + tcgen05 arch-cond
5 sm_101f 66 (gap) YES f Blackwell datacenter + tcgen05 family-cond
6 sm_103 67 (gap) no base Blackwell Ultra GB300 (datacenter)
7 sm_103a 68 (gap) YES a Blackwell Ultra GB300 + tcgen05 arch-cond
8 sm_103f 69 (gap) YES f Blackwell Ultra GB300 + tcgen05 family-cond
9 sm_110 70 (gap) no base Jetson Thor (embedded Blackwell-class)
10 sm_110a 71 (gap) YES a Jetson Thor + tcgen05 arch-cond
11 sm_110f 72 (gap) YES f Jetson Thor + tcgen05 family-cond
12 sm_120 73 0x78 no base Blackwell consumer RTX 50** / Pro enterprise
13 sm_120a 74 (gap) NO a Consumer RTX 50** arch-cond (no tcgen05)
14 sm_120f 75 (gap) NO f Consumer RTX 50** family-cond (no tcgen05)
15 sm_121 76 (gap) no base DGX Spark / B40-class
16 sm_121a 77 (gap) NO a DGX Spark arch-cond (no tcgen05)
17 sm_121f 78 (gap) NO f DGX Spark family-cond (no tcgen05)
Two architecturally important findings live in this table.
The first is that exactly eight CPUs imply tmem, and they are exactly the datacenter Blackwell and Jetson Thor arch-conditional and family-conditional variants: sm_100a, sm_100f, sm_101a, sm_101f, sm_103a, sm_103f, sm_110a, sm_110f. Their Implies bitsets each carry two bits — the self-bit plus bit 80 — while every other CPU row has only its single self-bit. Tensor Memory and the tcgen05.mma instruction family it gates are physically datacenter-only in NVIDIA's silicon planning. The base CPUs sm_100 / sm_101 / sm_103 / sm_110 deliberately omit the bit so that plain .target sm_100 PTX cannot reach tcgen05; the programmer must opt into .target sm_100a or .target sm_100f.
The second is that consumer Blackwell (sm_120 and variants) and DGX Spark (sm_121 and variants) never imply tmem, even in their arch-conditional or family-conditional forms. This is not build drift — sm_121a is alphabetically reachable through std::lower_bound, so a missing bit is a deliberate choice. Physical silicon for consumer Blackwell and Spark lacks Tensor Memory; consumer Blackwell gets mma.sync.aligned.*.block_scale as a weaker substitute, dispatched through a separate two-opcode MachineInstr path (5468 dense, 5469 sparse) rather than through TMEM-resident tcgen05 atoms.
Hopper's sm_90a is the only arch-conditional CPU that does not imply tmem.
Tensor Memory was introduced with Blackwell; Hopper's arch-conditional surface
covers WGMMA, TMA, and setmaxnreg instead. The plain sm_100 row also lacks
tmem, so programmers must opt into sm_100a or sm_100f to reach tensor
memory.
Two CPU rows, sm_73 and sm_82, behave like compatibility placeholders
with no known physical silicon. Conversely, the cubin reader recognizes
sm_87, but this subtarget table has no sm_87 CPU row. A correct
reimplementation wires CPU selection and cubin classification symmetrically
so any recognized target is also selectable by -mcpu.
Runtime Feature State
The runtime subtarget stores the target triple, selected CPU, feature string, references to the CPU and feature tables, a generic scheduling model, the populated feature bitset, and parsed numeric SM/PTX versions. Only this compact state is needed for codegen legality checks.
struct NvptxFeatureState {
string triple;
string cpu;
string tune_cpu;
string feature_string;
ArrayRef<SubtargetSubTypeKV> cpu_rows;
ArrayRef<SubtargetFeatureKV> feature_rows;
ArrayRef<StringRef> cpu_names;
const MCSchedModel *sched_model;
uint64_t feature_bits[6];
uint32_t sm_version_times_ten;
uint32_t ptx_version_times_ten;
uint32_t sm_minor;
};
feature_bits is the runtime bitset that hasFeature queries. It starts
empty; the selected CPU row's implies mask gets ORed in, then any
explicit -mattr=+feature flags. Masks and runtime bitset share the same
six-word shape, so population is a word-wise OR, not a per-bit loop.
static bool nvptx_has_feature(const NVPTXSubtarget *st, unsigned idx) {
return (st->FeatureBits[idx >> 6] >> (idx & 63)) & 1;
}
/* Concrete probes for the four interesting bits: */
/* HasSM90 = hasFeature(59) = (FeatureBits[0] >> 59) & 1 */
/* HasSM100 = hasFeature(61) = (FeatureBits[0] >> 61) & 1 */
/* HasSM100a = hasFeature(62) = (FeatureBits[0] >> 62) & 1 */
/* HasTMem = hasFeature(80) = (FeatureBits[1] >> 16) & 1 -- the lone NVIDIA-only bit */
Every TuneImplies slot in every SubTypeKV row is zero. Upstream LLVM uses this field to separate tuning advice (latency model, scheduling hints) from architectural feature implication; the NVPTX fork in this build leaves it empty. A faithful reimplementation leaves the TuneFeatures = [...] clause off the TableGen records. Every Implies slot in every FeatureKV row is zero too — features never transitively pull in other features in this binary. CPU rows carry the full implied set directly.
Cached Tensor-Memory Predicate
Hot instruction-selection paths use a cached boolean equivalent to
hasFeature(80). Semantically this is has_tmem: the target supports Tensor
Memory and can select tcgen05 instructions. The cache is an optimization.
Correctness still comes from the feature bitset.
static bool nvptx_has_feature(const NvptxFeatureState *state, unsigned idx) {
return (state->feature_bits[idx >> 6] >> (idx & 63)) & 1;
}
static bool nvptx_has_tmem(const NvptxFeatureState *state) {
return nvptx_has_feature(state, 80);
}
Reimplementations may cache has_tmem after CPU/feature parsing, but the cached
value must be derived from the same feature bitset that services normal
hasFeature queries.
Lookup at Runtime
The full -mcpu resolution path takes the user-supplied CPU string, runs
std::lower_bound against the alphabetically sorted CPU table, and on an
exact hit ORs that CPU row's implies mask into feature_bits. On a miss
Tileiras falls back to sm_52. Any -mattr=+feature flags parsed in the
same call set their respective bits directly, bypassing the CPU table.
After CPU parsing, SMVersionTimesTen derives from the numeric part of the
CPU name. sm_90a records 90, not 901, because the suffix is a variant
marker rather than a new major version. PTXVersionTimesTen only populates
when a +ptxNN feature is supplied.
void parse_nvptx_subtarget(NvptxFeatureState *state, string cpu, FeatureList attrs) {
const SubtargetSubTypeKV *row = lower_bound_cpu(state->cpu_rows, cpu);
if (row == NULL) {
row = lower_bound_cpu(state->cpu_rows, "sm_52");
}
or_feature_bits(state->feature_bits, row->implies);
for (FeatureAttr attr : attrs) {
set_feature_by_name(state, attr.name, attr.enabled);
}
state->sm_version_times_ten = parse_sm_major_times_ten(row->cpu_key);
state->ptx_version_times_ten = parse_ptx_version(attrs);
state->has_tmem_cache = nvptx_has_tmem(state);
}
Cross-References
Per-SM Emission Templates — Capability Matrix walks the PTX templates
each CPU's implied feature set unlocks, including the consumer-Blackwell
mma.sync.aligned.*.block_scale substitute used when sm_120 and sm_121 lack
tmem. PTX Version and Target Selection
explains how a CPU row plus a +ptxNN feature bit drive the
.version / .target header projection, and which arch-conditional
instruction families each a / f suffix unlocks.
NVPTX Bring-up — Target Registration Chain covers
the surrounding target setup. tcgen05 / WGMMA / mbarrier / Cluster Emission
covers the codegen paths guarded by the tensor-memory predicate.