libdevice `__nv_*` Symbol Catalog

Abstract

The libdevice bitcode shipped with CUDA exposes roughly 350 device-side math entry points behind the __nv_ prefix. They are the implementation surface that MLIR math.* / arith.* lowering and CUDA-C front ends target by name; every call site appears in the LLVM module as declare <type> @__nv_<name>(<args>) until Linker::linkModules pulls in the bitcode body and the always-inliner folds it into the caller. This page catalogues those symbols by family, names the reflection keys their bodies query, identifies the NVPTX hardware intrinsic each body decays into after NVVMReflectPass folds the configuration constants, and pins down the rounding-mode and FTZ matrix the symbols collectively cover.

The Intrinsic ID Switch and Name Table page documents how the LLVM constant folder classifies surviving call sites by name; the Math Pass Pipeline and Crosswalk page documents the MLIR-side rewrite from math.<op> to __nv_<name>. This page is the inventory in between — the names themselves, the bodies they unwrap to, and the reasons the body chooses one PTX form over another.

Naming convention

Every libdevice symbol decomposes into prefix, base name, type suffix, and optional rounding-mode suffix:

__nv_  <base>  [<rounding-mode>]  [<type-suffix>]

Component	Form	Examples	Notes
Prefix	`__nv_`	every entry	identifies device math; trips libdevice linker pattern
Base name	C99 / IEEE-754 root	`sin`, `cos`, `exp`, `log`, `sqrt`, `fma`, `pow`, `rint`	shared with libm; semantics match unless reflection keys override
Rounding mode	`_rn`, `_rz`, `_ru`, `_rd`	`__nv_dadd_rn`, `__nv_fdiv_ru`	optional; absent forms imply round-to-nearest-even
Type suffix	`f`, `d` (or none)	`__nv_sinf`, `__nv_sin`, `__nv_fabs` (default f64)	`f` = `float`, `d` or bare = `double`, `h`/`bf16` absent

The full grammar admits four orthogonal axes: input domain (f32/f64/i32/i64/u32/u64), rounding mode, FTZ behaviour, and approximation policy. A name like __nv_dadd_rn reads as "double add, round-to-nearest-even, full precision"; __nv_fast_powf reads as "float pow, fast path approximation, may flush denormals". Half-precision (f16, bf16) is intentionally absent — MLIR OpToFuncCallLowering promotes to f32 before the libdevice call and demotes via arith.truncf after, so libdevice never sees the narrow type.

Family inventory

The catalogue groups symbols by the IEEE-754 / C99 root family they belong to. Counts are the entries reachable from Linker::Flags::OnlyNeeded against a kernel that touches every published math intrinsic; bitcode versions with optional families may not ship a body for every entry in the table.

Trigonometric — circular

Symbol family	f32	f64	Fast path	Reflection key	Decay (when applicable)
Sine	`__nv_sinf`	`__nv_sin`	`__nv_fast_sinf`	`__CUDA_FTZ`, `__CUDA_ARCH`	`sin.approx.f32` (FTZ); Payne–Hanek otherwise
Cosine	`__nv_cosf`	`__nv_cos`	`__nv_fast_cosf`	`__CUDA_FTZ`	`cos.approx.f32` (FTZ); Payne–Hanek otherwise
Tangent	`__nv_tanf`	`__nv_tan`	`__nv_fast_tanf`	`__CUDA_FTZ`	`sin.approx`/`cos.approx` quotient on FTZ paths
Sine + cosine	`__nv_sincosf`	`__nv_sincos`	`__nv_fast_sincosf`	`__CUDA_FTZ`	fuses both PTX approximations; returns by pointer outs
Sine of π·x	`__nv_sinpif`	`__nv_sinpi`	—	—	scaled Payne–Hanek; argument is in half-cycles
Cosine of π·x	`__nv_cospif`	`__nv_cospi`	—	—	scaled Payne–Hanek; argument is in half-cycles
Arc sine	`__nv_asinf`	`__nv_asin`	—	—	libdevice-only; polynomial in `1 - x*x`
Arc cosine	`__nv_acosf`	`__nv_acos`	—	—	uses `asin` then subtracts from `π/2`
Arc tangent	`__nv_atanf`	`__nv_atan`	—	—	range-reduced rational approximation
Two-arg arc tan	`__nv_atan2f`	`__nv_atan2`	—	—	quadrant fixup on top of `atan`; matches C `atan2`

The __nv_fast_* aliases bind directly to the PTX approximate intrinsic (sin.approx.f32, cos.approx.f32) and skip Payne–Hanek range reduction; they are reachable through the fast-math math path or by name, never through MLIR math.* lowering on default settings.

Trigonometric — hyperbolic and inverse hyperbolic

Symbol family	f32	f64	Reflection key	Decay
Hyperbolic sine	`__nv_sinhf`	`__nv_sinh`	—	`(__nv_exp(x) - __nv_exp(-x)) * 0.5` with overflow guard
Hyperbolic cosine	`__nv_coshf`	`__nv_cosh`	—	`(__nv_exp(x) + __nv_exp(-x)) * 0.5` with overflow guard
Hyperbolic tangent	`__nv_tanhf`	`__nv_tanh`	—	rational approximation; sm_75+ uses `tanh.approx.f32` when present
Inverse hyperbolic sine	`__nv_asinhf`	`__nv_asinh`	—	`log(x + sqrt(x*x + 1))` with cancellation fix-up
Inverse hyperbolic cosine	`__nv_acoshf`	`__nv_acosh`	—	`log(x + sqrt(x*x - 1))`
Inverse hyperbolic tangent	`__nv_atanhf`	`__nv_atanh`	—	`0.5 * log1p(2x/(1-x))`

Exponential family

Symbol family	f32	f64	Fast path	Reflection key	Decay
Base-e exp	`__nv_expf`	`__nv_exp`	`__nv_fast_expf`	`__CUDA_FTZ`	`ex2.approx.f32` (`exp(x) = ex2(x * 1.4426950408)`)
Base-2 exp	`__nv_exp2f`	`__nv_exp2`	—	`__CUDA_FTZ`	`ex2.approx.f32` directly
Base-10 exp	`__nv_exp10f`	`__nv_exp10`	`__nv_fast_exp10f`	`__CUDA_FTZ`	`ex2.approx` after `* log2(10)`
`exp(x) - 1`	`__nv_expm1f`	`__nv_expm1`	—	—	libdevice-only; Estrin-form polynomial near 0
Natural log	`__nv_logf`	`__nv_log`	`__nv_fast_logf`	`__CUDA_FTZ`	`lg2.approx.f32` then `* 0.6931471806`
Base-2 log	`__nv_log2f`	`__nv_log2`	—	`__CUDA_FTZ`, `nvptx-approx-log2f32`	`lg2.approx.f32` directly
Base-10 log	`__nv_log10f`	`__nv_log10`	`__nv_fast_log10f`	`__CUDA_FTZ`	`lg2.approx` then `* 0.30102999566`
`log(1 + x)`	`__nv_log1pf`	`__nv_log1p`	—	—	libdevice-only; minimax polynomial
Power	`__nv_powf`	`__nv_pow`	`__nv_fast_powf`	`__CUDA_FTZ`	`lg2.approx` + `ex2.approx` composition
Integer power	`__nv_powif`	`__nv_powi`	—	—	repeated-squaring; integer exponent
`pow(x, n)` for `int n`	`__nv_fast_powf` (alias)	—	—	—	uses `lg2`/`ex2` regardless of integer-ness

The fast-path aliases are the entry points the fast-math pragma routes math ops through; they short-circuit the precision-checking guard arms and emit the bare ex2.approx.f32 / lg2.approx.f32 pair without finite-input cleanup.

Power-of-2 and integer-shift helpers

Symbol family	f32	f64	Notes
`ldexp(x, n)`	`__nv_ldexpf`	`__nv_ldexp`	integer scale `n` is `i32`; result is `x * 2^n`
`frexp(x, *n)`	`__nv_frexpf`	`__nv_frexp`	mantissa returned, exponent written through pointer
`scalbn(x, n)`	`__nv_scalbnf`	`__nv_scalbn`	identical to `ldexp` on IEEE-754 binary radix
`scalbln(x, l)`	`__nv_scalblnf`	`__nv_scalbln`	`long` exponent; libdevice clamps before scaling
`logb(x)`	`__nv_logbf`	`__nv_logb`	floor(log2(
`ilogb(x)`	`__nv_ilogbf`	`__nv_ilogb`	`int` exponent; raises domain error inline
`nextafter(x, y)`	`__nv_nextafterf`	`__nv_nextafter`	bitwise next representable; respects denormal direction

Rounding and sign manipulation

Symbol	Type	Decay
`__nv_floorf` / `__nv_floor`	round toward -∞	`cvt.rmi.f32.f32` (f32); libdevice body (f64)
`__nv_ceilf` / `__nv_ceil`	round toward +∞	`cvt.rpi.f32.f32` (f32); libdevice body (f64)
`__nv_truncf` / `__nv_trunc`	round toward 0	`cvt.rzi.f32.f32` (f32); libdevice body (f64)
`__nv_roundf` / `__nv_round`	round half-away-from-zero	libdevice-only — PTX has no matching mode
`__nv_rintf` / `__nv_rint`	round to nearest (current rounding mode)	`cvt.rni.f32.f32` (default IEEE)
`__nv_nearbyintf` / `__nv_nearbyint`	rint without inexact flag	same as `rint`; libdevice flag handling differs
`__nv_lroundf` / `__nv_lround`	round to `long`	`cvt.rni.s32.f32` after range check
`__nv_llroundf` / `__nv_llround`	round to `long long`	`cvt.rni.s64.f64` after range check
`__nv_lrintf` / `__nv_lrint`	rint to `long`	`cvt.rni.s32.f32`
`__nv_llrintf` / `__nv_llrint`	rint to `long long`	`cvt.rni.s64.f64`
`__nv_copysignf` / `__nv_copysign`	sign transfer	bit op; folds to `llvm.copysign.*`
`__nv_fabsf` / `__nv_fabs`	absolute value	bit-AND mask; folds to `llvm.fabs.*` or `abs.f32`
`__nv_signbitf` / `__nv_signbitd`	sign-bit test	shift-right of bit pattern

Min/max and classification

Symbol	Semantics	Decay
`__nv_fminf` / `__nv_fmin`	IEEE-754 minNum	`min.f32`/`min.f64` on sm_80+; libdevice body otherwise
`__nv_fmaxf` / `__nv_fmax`	IEEE-754 maxNum	`max.f32`/`max.f64` on sm_80+; libdevice body otherwise
`__nv_fminimumf` / `__nv_fminimum`	IEEE-754-2019 minimum (NaN-propagating)	bit ops + NaN check
`__nv_fmaximumf` / `__nv_fmaximum`	IEEE-754-2019 maximum (NaN-propagating)	bit ops + NaN check
`__nv_isfinitef` / `__nv_isfinited`	finite predicate	bit arithmetic on exponent field
`__nv_isinff` / `__nv_isinfd`	infinite predicate	bit arithmetic on exponent + mantissa
`__nv_isnanf` / `__nv_isnand`	NaN predicate	bit arithmetic; matches IEEE-754 quiet/sign-NaN definition
`__nv_finitef` / `__nv_finite`	legacy `isfinite` alias	aliased to `__nv_isfinitef`/`__nv_isfinited`

The min/max divergence is the most observable one. fmin/fmax follow IEEE-754-2008's "minNum" rule that returns the non-NaN operand when exactly one operand is NaN; fminimum/fmaximum follow IEEE-754-2019's "minimum" rule that returns NaN whenever any operand is NaN. The MLIR arith.minnumf and arith.maxnumf ops route to fmin/fmax; there are no MLIR ops covering fminimum/fmaximum, only direct front-end calls.

Roots, reciprocals, divides — the precision-keyed family

Symbol	f32	f64	Reflection key	Decay at key=0	Decay at key=1
Square root	`__nv_sqrtf`	`__nv_sqrt`	`__CUDA_PREC_SQRT`	`sqrt.approx.f32`	`sqrt.rn.f32`
Reciprocal sqrt	`__nv_rsqrtf`	`__nv_rsqrt`	—	`rsqrt.approx.f32`	(same — no precise form)
Division	`__nv_fdividef`	`__nv_fdivide`	`__CUDA_PREC_DIV`	`div.approx.f32`	`div.rn.f32`
Reciprocal	`__nv_frcp_rn` etc.	`__nv_drcp_rn` etc.	—	`rcp.approx.f32`	`rcp.rn.f32`
Cube root	`__nv_cbrtf`	`__nv_cbrt`	—	libdevice-only — polynomial + Newton refinement	(same)
Reciprocal cbrt	`__nv_rcbrtf`	`__nv_rcbrt`	—	libdevice-only — `1 / cbrt(x)` with sign fix	(same)
Hypot	`__nv_hypotf`	`__nv_hypot`	—	`sqrt(xx + yy)` with overflow guard	(same)
Reciprocal hypot	`__nv_rhypotf`	`__nv_rhypot`	—	`1 / hypot(x, y)`	(same)
3-argument hypot	`__nv_norm3df`	`__nv_norm3d`	—	`sqrt(xx + yy + z*z)`	(same)
4-argument hypot	`__nv_norm4df`	`__nv_norm4d`	—	same with one more term	(same)
n-argument hypot	`__nv_normf`	`__nv_norm`	—	loop; pointer + length args	(same)

__CUDA_PREC_SQRT and __CUDA_PREC_DIV are the two reflection keys with the most observable impact on libdevice output. Their 0 settings trip the approximate hardware path that the SASS engine schedules in a single cycle; their 1 settings replace the call with a software Newton-Raphson refinement on top of the approximate result, costing roughly five additional FMAs per call. The MLIR lowering path picks the key value from module-level !nvvm.reflection metadata seeded by the driver CLI options — tileiras defaults to __CUDA_PREC_DIV=1, __CUDA_PREC_SQRT=1 matching nvcc's default of full IEEE precision.

Integer arithmetic helpers

Symbol family	Width	Decay
`__nv_abs`	`i32` → `i32`	`(x ^ (x >> 31)) - (x >> 31)` — fully inlined
`__nv_llabs`	`i64` → `i64`	same idiom on 64-bit shift
`__nv_min` / `__nv_max`	`i32`	`min.s32` / `max.s32`
`__nv_umin` / `__nv_umax`	`u32`	`min.u32` / `max.u32`
`__nv_llmin` / `__nv_llmax`	`i64`	`min.s64` / `max.s64`
`__nv_ullmin` / `__nv_ullmax`	`u64`	`min.u64` / `max.u64`
`__nv_mul24`	`i32 × i32 → i32`	`mul24.s32` (24-bit truncated multiply)
`__nv_umul24`	`u32 × u32 → u32`	`mul24.u32`
`__nv_mul64hi`	`i64 × i64 → i64` (hi half)	`mul.hi.s64`
`__nv_umul64hi`	`u64 × u64 → u64` (hi half)	`mul.hi.u64`
`__nv_mulhi`	`i32 × i32 → i32` (hi half)	`mul.hi.s32`
`__nv_umulhi`	`u32 × u32 → u32` (hi half)	`mul.hi.u32`
`__nv_popc`	`u32` → `i32`	`popc.b32`
`__nv_popcll`	`u64` → `i32`	`popc.b64`
`__nv_clz` / `__nv_clzll`	leading zeros	`clz.b32` / `clz.b64`
`__nv_ffs` / `__nv_ffsll`	bit position of LSB	`bfind` family
`__nv_brev` / `__nv_brevll`	bit reverse	`brev.b32` / `brev.b64`
`__nv_sad` / `__nv_usad`	sum of absolute differences	`sad.s32` / `sad.u32`
`__nv_byte_perm`	byte permutation	`prmt.b32`
`__nv_funnelshift_l`/`_lc`/`_r`/`_rc`	64-bit funnel shifts	`shf.l/r.wrap/clamp.b32`

The mul24 family is the most architecture-dependent: pre-Volta hardware ran mul24.s32 as a single-issue instruction; sm_70+ runs the full 32-bit mul.lo.s32 at the same throughput, and the libdevice body simply forwards the call. Old CUDA-C code that explicitly calls __mul24 therefore retains the API surface but loses the historical performance benefit.

Mixed-mode conversions and float decoders

Symbol family	Direction	Decay
`__nv_int2float_{rn,rz,ru,rd}`	`i32` → `f32`	`cvt.<rnd>.f32.s32`
`__nv_uint2float_{rn,rz,ru,rd}`	`u32` → `f32`	`cvt.<rnd>.f32.u32`
`__nv_ll2float_{rn,rz,ru,rd}`	`i64` → `f32`	`cvt.<rnd>.f32.s64`
`__nv_ull2float_{rn,rz,ru,rd}`	`u64` → `f32`	`cvt.<rnd>.f32.u64`
`__nv_int2double_rn`	`i32` → `f64`	`cvt.f64.s32` (only `rn` is exact)
`__nv_double2int_{rn,rz,ru,rd}`	`f64` → `i32`	`cvt.<rnd>.s32.f64`
`__nv_float2int_{rn,rz,ru,rd}`	`f32` → `i32`	`cvt.<rnd>.s32.f32`
`__nv_double2float_{rn,rz,ru,rd}`	`f64` → `f32`	`cvt.<rnd>.f32.f64`
`__nv_float2half_{rn,rz}`	`f32` → `f16`	`cvt.<rnd>.f16.f32`
`__nv_half2float`	`f16` → `f32`	`cvt.f32.f16`
`__nv_float_as_int`	bit reinterpret	`mov.b32` (lossless)
`__nv_int_as_float`	bit reinterpret	`mov.b32` (lossless)
`__nv_longlong_as_double`	bit reinterpret	`mov.b64`
`__nv_double_as_longlong`	bit reinterpret	`mov.b64`
`__nv_double2hiint` / `_loint`	f64 → upper/lower 32 bits	`cvt.u32.u64` after `mov.b64`
`__nv_hiloint2double`	reassemble f64 from two i32	`mov.b64` of packed result

The *_as_* family is intentionally a no-op at the LLVM level; libdevice ships a body anyway so that the symbol exists and the bitcode linker has something to resolve. The body is a single bitcast followed by ret, which the always-inliner reduces to a register rename in the caller.

Error and gamma functions

Symbol	f32	f64	Notes
Error function	`__nv_erff`	`__nv_erf`	libdevice-only; rational approximation, double-double internals
Complementary erf	`__nv_erfcf`	`__nv_erfc`	libdevice-only; scaled `exp(-x*x)` path for large `
Inverse erf	`__nv_erfinvf`	`__nv_erfinv`	libdevice-only; iterative
Inverse erfc	`__nv_erfcinvf`	`__nv_erfcinv`	libdevice-only; iterative
Scaled erfc	`__nv_erfcxf`	`__nv_erfcx`	`exp(xx) erfc(x)`; large-x stable form
Gamma	`__nv_tgammaf`	`__nv_tgamma`	Stirling for large `x`, reflection for small `x`
Log-gamma	`__nv_lgammaf`	`__nv_lgamma`	log of
Norm CDF	`__nv_normcdff`	`__nv_normcdf`	`0.5 * erfc(-x/sqrt(2))`
Inverse norm CDF	`__nv_normcdfinvf`	`__nv_normcdfinv`	iterative on `erfinv`
Bessel J0 / J1	`__nv_j0f` / `__nv_j1f`	`__nv_j0` / `__nv_j1`	libdevice-only; minimax for small `x`, asymptotic for large
Bessel Y0 / Y1	`__nv_y0f` / `__nv_y1f`	`__nv_y0` / `__nv_y1`	libdevice-only; same shape
Bessel Jn / Yn	`__nv_jnf` / `__nv_ynf`	`__nv_jn` / `__nv_yn`	recurrence on the J0/J1, Y0/Y1 pair

Rounding-mode-qualified arithmetic

These are the "primitive" forms the MLIR lowering does not use directly, but which front-end code can call to force a specific rounding mode on a single op:

Op	f32 family	f64 family	Decay
Add	`__nv_fadd_rn` / `_rz` / `_ru` / `_rd`	`__nv_dadd_rn` / `_rz` / `_ru` / `_rd`	`add.<rnd>.f32` / `add.<rnd>.f64`
Subtract	`__nv_fsub_rn` etc.	`__nv_dsub_rn` etc.	`sub.<rnd>.f32` / `sub.<rnd>.f64`
Multiply	`__nv_fmul_rn` etc.	`__nv_dmul_rn` etc.	`mul.<rnd>.f32` / `mul.<rnd>.f64`
Divide	`__nv_fdiv_rn` etc.	`__nv_ddiv_rn` etc.	`div.<rnd>.f32` / `div.<rnd>.f64`; `_rn` is the only IEEE-correct form
FMA	`__nv_fmaf_rn` etc.	`__nv_fma_rn` etc.	`fma.<rnd>.f32` / `fma.<rnd>.f64`
Reciprocal	`__nv_frcp_rn` etc.	`__nv_drcp_rn` etc.	`rcp.<rnd>.f32` / `rcp.<rnd>.f64`
Square root	`__nv_fsqrt_rn` etc.	`__nv_dsqrt_rn` etc.	`sqrt.<rnd>.f32` / `sqrt.<rnd>.f64`

The MLIR pipeline never emits these names directly; they are reachable only through CUDA-C intrinsic shims (__fadd_rn etc. without the __nv_ prefix) and pass through the libdevice linker unchanged.

Reflection-key cross-reference

The reflection keys consumed by libdevice bodies fall into four orthogonal axes:

Key	Type	Values	Effect on bodies that read it
`__CUDA_FTZ`	bool	`0` (preserve), `1` (flush)	Selects FTZ vs non-FTZ approximate-intrinsic variant in `sin`, `cos`, `tan`, `exp`, `log`, `pow`, etc. Bodies typically have `if (__nvvm_reflect("__CUDA_FTZ"))` arms wrapping the `sin.approx.ftz.f32` / `sin.approx.f32` selection.
`__CUDA_PREC_DIV`	bool	`0` (approx), `1` (IEEE)	`__nv_fdividef` and `__nv_fdivide` choose `div.approx.f32` vs `div.rn.f32` + Newton refinement. `nvcc` default is `1`; `--use_fast_math` flips to `0`.
`__CUDA_PREC_SQRT`	bool	`0` (approx), `1` (IEEE)	`__nv_sqrtf` and `__nv_sqrt` choose `sqrt.approx.f32` vs `sqrt.rn.f32`. Default and flip behaviour mirror `__CUDA_PREC_DIV`.
`__CUDA_FAST_INT_DIV`	bool	`0`, `1`	Integer division and modulo libdevice helpers (`__nv_idiv`, `__nv_imod`, etc., if present in the bitcode) choose between the reference 32-bit algorithm and the truncated approximation.
`__CUDA_ARCH`	int	`700`, `750`, `800`, `860`, `890`, `900`, `1000`, `1030`, `1200`, …	Selects per-SM intrinsic availability inside bodies that fall back to legacy paths on older hardware.

Bodies that do not query any reflection key are non-configurable; they emit the same NVPTX intrinsic regardless of target options. The libdevice overview pipeline folds the reflection keys before the always-inliner runs, so reflection-driven branches are dead by the time the inliner copies the body into the caller.

SM-floor inventory

A handful of __nv_* symbols decay into instructions whose lowest PTX support level is later than the rest of libdevice. Calls to these symbols from a kernel compiled for an older SM produce libdevice fall-back bodies rather than the named instruction.

Symbol family	Decay floor	Older-SM fallback
`__nv_fminf` / `__nv_fmaxf`	sm_80 `min.f32`/`max.f32`	branch-and-select bit logic
`__nv_fmin` / `__nv_fmax`	sm_80 `min.f64`/`max.f64`	branch-and-select
`__nv_tanhf`	sm_75 `tanh.approx.f32`	rational approximation in software
Block-scaled `__nv_cvt_*` (FP8 / FP4)	sm_89 / sm_100a `cvt.packfloat.*`	not provided — undefined behaviour on older SMs
`__nv_fma_relu_*`	sm_75 (f16) / sm_90a (f8)	not provided — softmax-style ReLU+FMA fused intrinsic is sm-gated
Tensor-memory casts	sm_100a tcgen05 path	not in libdevice — these live in `nvvm`

The libdevice "fall-back" body is the same body the reflection-folded reference path uses; the only difference is that the always-inliner cannot collapse the body into a single PTX instruction because the PTX form does not exist yet.

Linker behaviour and dead-call elimination

Libdevice bitcode is linked with Linker::Flags::OnlyNeeded. The linker walks the user module's declaration set, copies in the matching definitions, and recursively pulls in any further __nv_* declarations the freshly-imported bodies reference. The __CUDA_FTZ / __CUDA_PREC_* reflection arms typically reference both the FTZ and the non-FTZ helper symbols, so a library body that ultimately resolves to a single arm still drags the unused arm's helpers into the user module. The post-inline GlobalDCEPass cleans them up:

1. Linker pulls in __nv_sinf body, which references __nv_sin_kernel_ftz, __nv_sin_kernel_nonftz.
2. NVVMReflectPass folds the FTZ arm to the chosen path.
3. AlwaysInlinerPass inlines __nv_sinf into the caller.
4. SimplifyCFG + SCCP eliminate the dead arm and its helper call.
5. GlobalDCEPass removes the orphaned __nv_sin_kernel_<other> from the module.

Steps 4 and 5 are why the libdevice bitcode appears tiny in the final PTX even though the bitcode blob is several megabytes. The pre-DCE module size can be 5–10× the final size; the dead-arm elimination is the single largest IR shrink in the libdevice integration path.

Verification invariants

Three invariants hold across libdevice integration. Violations are caught by NVVMIRVerifier before the NVPTX backend runs.

Every __nv_* declaration is resolved before code generation. A surviving declaration is a backend error.
Every __nvvm_reflect("KEY") call is folded into a ConstantInt before always-inlining. A surviving reflect call is a configuration bug.
No __nv_* body retains a __nvvm_reflect call after the four-pass integration; the post-link nvvm-reflect-pp cleanup folds the constant branches and removes any dangling intrinsic call sites.

QUIRK: Unknown reflection keys silently fold to zero

NVVMReflectPass::populateVarMap defaults missing keys to 0 and records the zero in the resolved map so that every later call site folds to the same value. A typo in __nvvm_reflect("__CUDA_FFZ") (with double-F) is therefore not a diagnostic — it is a silent reset to the FTZ-off behaviour, applied consistently. The only way to notice is to inspect the post-reflect IR and check that the key the body queries is the key the configuration set. Reimplementations that diverge from this — for example by warning on unknown keys, or by returning -1 to indicate "unknown" — break libdevice bodies that rely on the recorded-zero behaviour for legacy options that the bitcode references but the current configuration system does not know about.

QUIRK: `_rn` is the only IEEE-correct division and square root

__nv_fdiv_rn and __nv_fsqrt_rn decay to div.rn.f32 and sqrt.rn.f32 — the only PTX divide and square-root variants that the IEEE-754 standard certifies as correctly rounded. The _rz, _ru, and _rd variants are valid hardware instructions but do not satisfy IEEE-754 single-step correctness for division and square root: they round the approximate result rather than the mathematically exact one. Libdevice does not paper over this — code that calls __nv_fdiv_ru(a, b) gets the directed-rounded approximation, not a Newton-refined directed-rounded result. The MLIR arith dialect has no rounding-mode parameter on arith.divf, so this asymmetry is only reachable through CUDA-C intrinsics; MLIR-fronted code always sees the round-to-nearest path.

QUIRK: `__nv_fast_*` are libdevice symbols, not preprocessor macros

__nv_fast_sinf, __nv_fast_cosf, __nv_fast_powf, etc. exist as separate bitcode symbols, not as #define-style rewrites of __nv_sinf and friends. They have distinct bodies — typically a single sin.approx.ftz.f32 call — and their existence is what allows --use_fast_math to substitute the symbol name during MLIR OpToFuncCallLowering selection without recompiling the libdevice bitcode. A reimplementation that treats __nv_fast_sinf as a macro alias of __nv_sinf will lose the FTZ behaviour the fast-path body enforces unconditionally; the slow-path body is FTZ-conditional on __CUDA_FTZ, and a fast-math build with __CUDA_FTZ=0 (the IEEE-clean default) would then silently preserve denormals where CUDA's bitcode would flush them.

Cross-references

The four-pass integration sequence that turns these declarations into concrete bodies is documented in libdevice Overview — Pipeline. The reflection keys that gate body selection are documented in NVVMReflect Mechanism — Three var-map sources. The MLIR-side rewriter that emits the __nv_* call sites these symbols define is documented in Math Pass Pipeline and Crosswalk — Full math-op crosswalk. The LLVM constant folder that classifies any surviving by-name call sites is documented in Intrinsic ID Switch and Name Table — libdevice suffix name table. The fast-math pragma that selects the __nv_fast_* family over the precision-keyed family is discussed in Fast Math and Numerical Precision.

Keyboard shortcuts

Tileiras Internals