Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

libdevice __nv_* Symbol Catalog

Abstract

The libdevice bitcode shipped with CUDA exposes roughly 350 device-side math entry points behind the __nv_ prefix. They are the implementation surface that MLIR math.* / arith.* lowering and CUDA-C front ends target by name; every call site appears in the LLVM module as declare <type> @__nv_<name>(<args>) until Linker::linkModules pulls in the bitcode body and the always-inliner folds it into the caller. This page catalogues those symbols by family, names the reflection keys their bodies query, identifies the NVPTX hardware intrinsic each body decays into after NVVMReflectPass folds the configuration constants, and pins down the rounding-mode and FTZ matrix the symbols collectively cover.

The Intrinsic ID Switch and Name Table page documents how the LLVM constant folder classifies surviving call sites by name; the Math Pass Pipeline and Crosswalk page documents the MLIR-side rewrite from math.<op> to __nv_<name>. This page is the inventory in between — the names themselves, the bodies they unwrap to, and the reasons the body chooses one PTX form over another.

Naming convention

Every libdevice symbol decomposes into prefix, base name, type suffix, and optional rounding-mode suffix:

__nv_  <base>  [<rounding-mode>]  [<type-suffix>]
ComponentFormExamplesNotes
Prefix__nv_every entryidentifies device math; trips libdevice linker pattern
Base nameC99 / IEEE-754 rootsin, cos, exp, log, sqrt, fma, pow, rintshared with libm; semantics match unless reflection keys override
Rounding mode_rn, _rz, _ru, _rd__nv_dadd_rn, __nv_fdiv_ruoptional; absent forms imply round-to-nearest-even
Type suffixf, d (or none)__nv_sinf, __nv_sin, __nv_fabs (default f64)f = float, d or bare = double, h/bf16 absent

The full grammar admits four orthogonal axes: input domain (f32/f64/i32/i64/u32/u64), rounding mode, FTZ behaviour, and approximation policy. A name like __nv_dadd_rn reads as "double add, round-to-nearest-even, full precision"; __nv_fast_powf reads as "float pow, fast path approximation, may flush denormals". Half-precision (f16, bf16) is intentionally absent — MLIR OpToFuncCallLowering promotes to f32 before the libdevice call and demotes via arith.truncf after, so libdevice never sees the narrow type.

Family inventory

The catalogue groups symbols by the IEEE-754 / C99 root family they belong to. Counts are the entries reachable from Linker::Flags::OnlyNeeded against a kernel that touches every published math intrinsic; bitcode versions with optional families may not ship a body for every entry in the table.

Trigonometric — circular

Symbol familyf32f64Fast pathReflection keyDecay (when applicable)
Sine__nv_sinf__nv_sin__nv_fast_sinf__CUDA_FTZ, __CUDA_ARCHsin.approx.f32 (FTZ); Payne–Hanek otherwise
Cosine__nv_cosf__nv_cos__nv_fast_cosf__CUDA_FTZcos.approx.f32 (FTZ); Payne–Hanek otherwise
Tangent__nv_tanf__nv_tan__nv_fast_tanf__CUDA_FTZsin.approx/cos.approx quotient on FTZ paths
Sine + cosine__nv_sincosf__nv_sincos__nv_fast_sincosf__CUDA_FTZfuses both PTX approximations; returns by pointer outs
Sine of π·x__nv_sinpif__nv_sinpiscaled Payne–Hanek; argument is in half-cycles
Cosine of π·x__nv_cospif__nv_cospiscaled Payne–Hanek; argument is in half-cycles
Arc sine__nv_asinf__nv_asinlibdevice-only; polynomial in 1 - x*x
Arc cosine__nv_acosf__nv_acosuses asin then subtracts from π/2
Arc tangent__nv_atanf__nv_atanrange-reduced rational approximation
Two-arg arc tan__nv_atan2f__nv_atan2quadrant fixup on top of atan; matches C atan2

The __nv_fast_* aliases bind directly to the PTX approximate intrinsic (sin.approx.f32, cos.approx.f32) and skip Payne–Hanek range reduction; they are reachable through the fast-math math path or by name, never through MLIR math.* lowering on default settings.

Trigonometric — hyperbolic and inverse hyperbolic

Symbol familyf32f64Reflection keyDecay
Hyperbolic sine__nv_sinhf__nv_sinh(__nv_exp(x) - __nv_exp(-x)) * 0.5 with overflow guard
Hyperbolic cosine__nv_coshf__nv_cosh(__nv_exp(x) + __nv_exp(-x)) * 0.5 with overflow guard
Hyperbolic tangent__nv_tanhf__nv_tanhrational approximation; sm_75+ uses tanh.approx.f32 when present
Inverse hyperbolic sine__nv_asinhf__nv_asinhlog(x + sqrt(x*x + 1)) with cancellation fix-up
Inverse hyperbolic cosine__nv_acoshf__nv_acoshlog(x + sqrt(x*x - 1))
Inverse hyperbolic tangent__nv_atanhf__nv_atanh0.5 * log1p(2x/(1-x))

Exponential family

Symbol familyf32f64Fast pathReflection keyDecay
Base-e exp__nv_expf__nv_exp__nv_fast_expf__CUDA_FTZex2.approx.f32 (exp(x) = ex2(x * 1.4426950408))
Base-2 exp__nv_exp2f__nv_exp2__CUDA_FTZex2.approx.f32 directly
Base-10 exp__nv_exp10f__nv_exp10__nv_fast_exp10f__CUDA_FTZex2.approx after * log2(10)
exp(x) - 1__nv_expm1f__nv_expm1libdevice-only; Estrin-form polynomial near 0
Natural log__nv_logf__nv_log__nv_fast_logf__CUDA_FTZlg2.approx.f32 then * 0.6931471806
Base-2 log__nv_log2f__nv_log2__CUDA_FTZ, nvptx-approx-log2f32lg2.approx.f32 directly
Base-10 log__nv_log10f__nv_log10__nv_fast_log10f__CUDA_FTZlg2.approx then * 0.30102999566
log(1 + x)__nv_log1pf__nv_log1plibdevice-only; minimax polynomial
Power__nv_powf__nv_pow__nv_fast_powf__CUDA_FTZlg2.approx + ex2.approx composition
Integer power__nv_powif__nv_powirepeated-squaring; integer exponent
pow(x, n) for int n__nv_fast_powf (alias)uses lg2/ex2 regardless of integer-ness

The fast-path aliases are the entry points the fast-math pragma routes math ops through; they short-circuit the precision-checking guard arms and emit the bare ex2.approx.f32 / lg2.approx.f32 pair without finite-input cleanup.

Power-of-2 and integer-shift helpers

Symbol familyf32f64Notes
ldexp(x, n)__nv_ldexpf__nv_ldexpinteger scale n is i32; result is x * 2^n
frexp(x, *n)__nv_frexpf__nv_frexpmantissa returned, exponent written through pointer
scalbn(x, n)__nv_scalbnf__nv_scalbnidentical to ldexp on IEEE-754 binary radix
scalbln(x, l)__nv_scalblnf__nv_scalblnlong exponent; libdevice clamps before scaling
logb(x)__nv_logbf__nv_logbfloor(log2(
ilogb(x)__nv_ilogbf__nv_ilogbint exponent; raises domain error inline
nextafter(x, y)__nv_nextafterf__nv_nextafterbitwise next representable; respects denormal direction

Rounding and sign manipulation

SymbolTypeDecay
__nv_floorf / __nv_floorround toward -∞cvt.rmi.f32.f32 (f32); libdevice body (f64)
__nv_ceilf / __nv_ceilround toward +∞cvt.rpi.f32.f32 (f32); libdevice body (f64)
__nv_truncf / __nv_truncround toward 0cvt.rzi.f32.f32 (f32); libdevice body (f64)
__nv_roundf / __nv_roundround half-away-from-zerolibdevice-only — PTX has no matching mode
__nv_rintf / __nv_rintround to nearest (current rounding mode)cvt.rni.f32.f32 (default IEEE)
__nv_nearbyintf / __nv_nearbyintrint without inexact flagsame as rint; libdevice flag handling differs
__nv_lroundf / __nv_lroundround to longcvt.rni.s32.f32 after range check
__nv_llroundf / __nv_llroundround to long longcvt.rni.s64.f64 after range check
__nv_lrintf / __nv_lrintrint to longcvt.rni.s32.f32
__nv_llrintf / __nv_llrintrint to long longcvt.rni.s64.f64
__nv_copysignf / __nv_copysignsign transferbit op; folds to llvm.copysign.*
__nv_fabsf / __nv_fabsabsolute valuebit-AND mask; folds to llvm.fabs.* or abs.f32
__nv_signbitf / __nv_signbitdsign-bit testshift-right of bit pattern

Min/max and classification

SymbolSemanticsDecay
__nv_fminf / __nv_fminIEEE-754 minNummin.f32/min.f64 on sm_80+; libdevice body otherwise
__nv_fmaxf / __nv_fmaxIEEE-754 maxNummax.f32/max.f64 on sm_80+; libdevice body otherwise
__nv_fminimumf / __nv_fminimumIEEE-754-2019 minimum (NaN-propagating)bit ops + NaN check
__nv_fmaximumf / __nv_fmaximumIEEE-754-2019 maximum (NaN-propagating)bit ops + NaN check
__nv_isfinitef / __nv_isfinitedfinite predicatebit arithmetic on exponent field
__nv_isinff / __nv_isinfdinfinite predicatebit arithmetic on exponent + mantissa
__nv_isnanf / __nv_isnandNaN predicatebit arithmetic; matches IEEE-754 quiet/sign-NaN definition
__nv_finitef / __nv_finitelegacy isfinite aliasaliased to __nv_isfinitef/__nv_isfinited

The min/max divergence is the most observable one. fmin/fmax follow IEEE-754-2008's "minNum" rule that returns the non-NaN operand when exactly one operand is NaN; fminimum/fmaximum follow IEEE-754-2019's "minimum" rule that returns NaN whenever any operand is NaN. The MLIR arith.minnumf and arith.maxnumf ops route to fmin/fmax; there are no MLIR ops covering fminimum/fmaximum, only direct front-end calls.

Roots, reciprocals, divides — the precision-keyed family

Symbolf32f64Reflection keyDecay at key=0Decay at key=1
Square root__nv_sqrtf__nv_sqrt__CUDA_PREC_SQRTsqrt.approx.f32sqrt.rn.f32
Reciprocal sqrt__nv_rsqrtf__nv_rsqrtrsqrt.approx.f32(same — no precise form)
Division__nv_fdividef__nv_fdivide__CUDA_PREC_DIVdiv.approx.f32div.rn.f32
Reciprocal__nv_frcp_rn etc.__nv_drcp_rn etc.rcp.approx.f32rcp.rn.f32
Cube root__nv_cbrtf__nv_cbrtlibdevice-only — polynomial + Newton refinement(same)
Reciprocal cbrt__nv_rcbrtf__nv_rcbrtlibdevice-only — 1 / cbrt(x) with sign fix(same)
Hypot__nv_hypotf__nv_hypotsqrt(x*x + y*y) with overflow guard(same)
Reciprocal hypot__nv_rhypotf__nv_rhypot1 / hypot(x, y)(same)
3-argument hypot__nv_norm3df__nv_norm3dsqrt(x*x + y*y + z*z)(same)
4-argument hypot__nv_norm4df__nv_norm4dsame with one more term(same)
n-argument hypot__nv_normf__nv_normloop; pointer + length args(same)

__CUDA_PREC_SQRT and __CUDA_PREC_DIV are the two reflection keys with the most observable impact on libdevice output. Their 0 settings trip the approximate hardware path that the SASS engine schedules in a single cycle; their 1 settings replace the call with a software Newton-Raphson refinement on top of the approximate result, costing roughly five additional FMAs per call. The MLIR lowering path picks the key value from module-level !nvvm.reflection metadata seeded by the driver CLI optionstileiras defaults to __CUDA_PREC_DIV=1, __CUDA_PREC_SQRT=1 matching nvcc's default of full IEEE precision.

Integer arithmetic helpers

Symbol familyWidthDecay
__nv_absi32i32(x ^ (x >> 31)) - (x >> 31) — fully inlined
__nv_llabsi64i64same idiom on 64-bit shift
__nv_min / __nv_maxi32min.s32 / max.s32
__nv_umin / __nv_umaxu32min.u32 / max.u32
__nv_llmin / __nv_llmaxi64min.s64 / max.s64
__nv_ullmin / __nv_ullmaxu64min.u64 / max.u64
__nv_mul24i32 × i32 → i32mul24.s32 (24-bit truncated multiply)
__nv_umul24u32 × u32 → u32mul24.u32
__nv_mul64hii64 × i64 → i64 (hi half)mul.hi.s64
__nv_umul64hiu64 × u64 → u64 (hi half)mul.hi.u64
__nv_mulhii32 × i32 → i32 (hi half)mul.hi.s32
__nv_umulhiu32 × u32 → u32 (hi half)mul.hi.u32
__nv_popcu32i32popc.b32
__nv_popcllu64i32popc.b64
__nv_clz / __nv_clzllleading zerosclz.b32 / clz.b64
__nv_ffs / __nv_ffsllbit position of LSBbfind family
__nv_brev / __nv_brevllbit reversebrev.b32 / brev.b64
__nv_sad / __nv_usadsum of absolute differencessad.s32 / sad.u32
__nv_byte_permbyte permutationprmt.b32
__nv_funnelshift_l/_lc/_r/_rc64-bit funnel shiftsshf.l/r.wrap/clamp.b32

The mul24 family is the most architecture-dependent: pre-Volta hardware ran mul24.s32 as a single-issue instruction; sm_70+ runs the full 32-bit mul.lo.s32 at the same throughput, and the libdevice body simply forwards the call. Old CUDA-C code that explicitly calls __mul24 therefore retains the API surface but loses the historical performance benefit.

Mixed-mode conversions and float decoders

Symbol familyDirectionDecay
__nv_int2float_{rn,rz,ru,rd}i32f32cvt.<rnd>.f32.s32
__nv_uint2float_{rn,rz,ru,rd}u32f32cvt.<rnd>.f32.u32
__nv_ll2float_{rn,rz,ru,rd}i64f32cvt.<rnd>.f32.s64
__nv_ull2float_{rn,rz,ru,rd}u64f32cvt.<rnd>.f32.u64
__nv_int2double_rni32f64cvt.f64.s32 (only rn is exact)
__nv_double2int_{rn,rz,ru,rd}f64i32cvt.<rnd>.s32.f64
__nv_float2int_{rn,rz,ru,rd}f32i32cvt.<rnd>.s32.f32
__nv_double2float_{rn,rz,ru,rd}f64f32cvt.<rnd>.f32.f64
__nv_float2half_{rn,rz}f32f16cvt.<rnd>.f16.f32
__nv_half2floatf16f32cvt.f32.f16
__nv_float_as_intbit reinterpretmov.b32 (lossless)
__nv_int_as_floatbit reinterpretmov.b32 (lossless)
__nv_longlong_as_doublebit reinterpretmov.b64
__nv_double_as_longlongbit reinterpretmov.b64
__nv_double2hiint / _lointf64 → upper/lower 32 bitscvt.u32.u64 after mov.b64
__nv_hiloint2doublereassemble f64 from two i32mov.b64 of packed result

The *_as_* family is intentionally a no-op at the LLVM level; libdevice ships a body anyway so that the symbol exists and the bitcode linker has something to resolve. The body is a single bitcast followed by ret, which the always-inliner reduces to a register rename in the caller.

Error and gamma functions

Symbolf32f64Notes
Error function__nv_erff__nv_erflibdevice-only; rational approximation, double-double internals
Complementary erf__nv_erfcf__nv_erfclibdevice-only; scaled exp(-x*x) path for large `
Inverse erf__nv_erfinvf__nv_erfinvlibdevice-only; iterative
Inverse erfc__nv_erfcinvf__nv_erfcinvlibdevice-only; iterative
Scaled erfc__nv_erfcxf__nv_erfcxexp(x*x) * erfc(x); large-x stable form
Gamma__nv_tgammaf__nv_tgammaStirling for large x, reflection for small x
Log-gamma__nv_lgammaf__nv_lgammalog of
Norm CDF__nv_normcdff__nv_normcdf0.5 * erfc(-x/sqrt(2))
Inverse norm CDF__nv_normcdfinvf__nv_normcdfinviterative on erfinv
Bessel J0 / J1__nv_j0f / __nv_j1f__nv_j0 / __nv_j1libdevice-only; minimax for small x, asymptotic for large
Bessel Y0 / Y1__nv_y0f / __nv_y1f__nv_y0 / __nv_y1libdevice-only; same shape
Bessel Jn / Yn__nv_jnf / __nv_ynf__nv_jn / __nv_ynrecurrence on the J0/J1, Y0/Y1 pair

Rounding-mode-qualified arithmetic

These are the "primitive" forms the MLIR lowering does not use directly, but which front-end code can call to force a specific rounding mode on a single op:

Opf32 familyf64 familyDecay
Add__nv_fadd_rn / _rz / _ru / _rd__nv_dadd_rn / _rz / _ru / _rdadd.<rnd>.f32 / add.<rnd>.f64
Subtract__nv_fsub_rn etc.__nv_dsub_rn etc.sub.<rnd>.f32 / sub.<rnd>.f64
Multiply__nv_fmul_rn etc.__nv_dmul_rn etc.mul.<rnd>.f32 / mul.<rnd>.f64
Divide__nv_fdiv_rn etc.__nv_ddiv_rn etc.div.<rnd>.f32 / div.<rnd>.f64; _rn is the only IEEE-correct form
FMA__nv_fmaf_rn etc.__nv_fma_rn etc.fma.<rnd>.f32 / fma.<rnd>.f64
Reciprocal__nv_frcp_rn etc.__nv_drcp_rn etc.rcp.<rnd>.f32 / rcp.<rnd>.f64
Square root__nv_fsqrt_rn etc.__nv_dsqrt_rn etc.sqrt.<rnd>.f32 / sqrt.<rnd>.f64

The MLIR pipeline never emits these names directly; they are reachable only through CUDA-C intrinsic shims (__fadd_rn etc. without the __nv_ prefix) and pass through the libdevice linker unchanged.

Reflection-key cross-reference

The reflection keys consumed by libdevice bodies fall into four orthogonal axes:

KeyTypeValuesEffect on bodies that read it
__CUDA_FTZbool0 (preserve), 1 (flush)Selects FTZ vs non-FTZ approximate-intrinsic variant in sin, cos, tan, exp, log, pow, etc. Bodies typically have if (__nvvm_reflect("__CUDA_FTZ")) arms wrapping the sin.approx.ftz.f32 / sin.approx.f32 selection.
__CUDA_PREC_DIVbool0 (approx), 1 (IEEE)__nv_fdividef and __nv_fdivide choose div.approx.f32 vs div.rn.f32 + Newton refinement. nvcc default is 1; --use_fast_math flips to 0.
__CUDA_PREC_SQRTbool0 (approx), 1 (IEEE)__nv_sqrtf and __nv_sqrt choose sqrt.approx.f32 vs sqrt.rn.f32. Default and flip behaviour mirror __CUDA_PREC_DIV.
__CUDA_FAST_INT_DIVbool0, 1Integer division and modulo libdevice helpers (__nv_idiv, __nv_imod, etc., if present in the bitcode) choose between the reference 32-bit algorithm and the truncated approximation.
__CUDA_ARCHint700, 750, 800, 860, 890, 900, 1000, 1030, 1200, …Selects per-SM intrinsic availability inside bodies that fall back to legacy paths on older hardware.

Bodies that do not query any reflection key are non-configurable; they emit the same NVPTX intrinsic regardless of target options. The libdevice overview pipeline folds the reflection keys before the always-inliner runs, so reflection-driven branches are dead by the time the inliner copies the body into the caller.

SM-floor inventory

A handful of __nv_* symbols decay into instructions whose lowest PTX support level is later than the rest of libdevice. Calls to these symbols from a kernel compiled for an older SM produce libdevice fall-back bodies rather than the named instruction.

Symbol familyDecay floorOlder-SM fallback
__nv_fminf / __nv_fmaxfsm_80 min.f32/max.f32branch-and-select bit logic
__nv_fmin / __nv_fmaxsm_80 min.f64/max.f64branch-and-select
__nv_tanhfsm_75 tanh.approx.f32rational approximation in software
Block-scaled __nv_cvt_* (FP8 / FP4)sm_89 / sm_100a cvt.packfloat.*not provided — undefined behaviour on older SMs
__nv_fma_relu_*sm_75 (f16) / sm_90a (f8)not provided — softmax-style ReLU+FMA fused intrinsic is sm-gated
Tensor-memory castssm_100a tcgen05 pathnot in libdevice — these live in nvvm

The libdevice "fall-back" body is the same body the reflection-folded reference path uses; the only difference is that the always-inliner cannot collapse the body into a single PTX instruction because the PTX form does not exist yet.

Linker behaviour and dead-call elimination

Libdevice bitcode is linked with Linker::Flags::OnlyNeeded. The linker walks the user module's declaration set, copies in the matching definitions, and recursively pulls in any further __nv_* declarations the freshly-imported bodies reference. The __CUDA_FTZ / __CUDA_PREC_* reflection arms typically reference both the FTZ and the non-FTZ helper symbols, so a library body that ultimately resolves to a single arm still drags the unused arm's helpers into the user module. The post-inline GlobalDCEPass cleans them up:

1. Linker pulls in __nv_sinf body, which references __nv_sin_kernel_ftz, __nv_sin_kernel_nonftz.
2. NVVMReflectPass folds the FTZ arm to the chosen path.
3. AlwaysInlinerPass inlines __nv_sinf into the caller.
4. SimplifyCFG + SCCP eliminate the dead arm and its helper call.
5. GlobalDCEPass removes the orphaned __nv_sin_kernel_<other> from the module.

Steps 4 and 5 are why the libdevice bitcode appears tiny in the final PTX even though the bitcode blob is several megabytes. The pre-DCE module size can be 5–10× the final size; the dead-arm elimination is the single largest IR shrink in the libdevice integration path.

Verification invariants

Three invariants hold across libdevice integration. Violations are caught by NVVMIRVerifier before the NVPTX backend runs.

  • Every __nv_* declaration is resolved before code generation. A surviving declaration is a backend error.
  • Every __nvvm_reflect("KEY") call is folded into a ConstantInt before always-inlining. A surviving reflect call is a configuration bug.
  • No __nv_* body retains a __nvvm_reflect call after the four-pass integration; the post-link nvvm-reflect-pp cleanup folds the constant branches and removes any dangling intrinsic call sites.

QUIRK: Unknown reflection keys silently fold to zero

NVVMReflectPass::populateVarMap defaults missing keys to 0 and records the zero in the resolved map so that every later call site folds to the same value. A typo in __nvvm_reflect("__CUDA_FFZ") (with double-F) is therefore not a diagnostic — it is a silent reset to the FTZ-off behaviour, applied consistently. The only way to notice is to inspect the post-reflect IR and check that the key the body queries is the key the configuration set. Reimplementations that diverge from this — for example by warning on unknown keys, or by returning -1 to indicate "unknown" — break libdevice bodies that rely on the recorded-zero behaviour for legacy options that the bitcode references but the current configuration system does not know about.

QUIRK: _rn is the only IEEE-correct division and square root

__nv_fdiv_rn and __nv_fsqrt_rn decay to div.rn.f32 and sqrt.rn.f32 — the only PTX divide and square-root variants that the IEEE-754 standard certifies as correctly rounded. The _rz, _ru, and _rd variants are valid hardware instructions but do not satisfy IEEE-754 single-step correctness for division and square root: they round the approximate result rather than the mathematically exact one. Libdevice does not paper over this — code that calls __nv_fdiv_ru(a, b) gets the directed-rounded approximation, not a Newton-refined directed-rounded result. The MLIR arith dialect has no rounding-mode parameter on arith.divf, so this asymmetry is only reachable through CUDA-C intrinsics; MLIR-fronted code always sees the round-to-nearest path.

QUIRK: __nv_fast_* are libdevice symbols, not preprocessor macros

__nv_fast_sinf, __nv_fast_cosf, __nv_fast_powf, etc. exist as separate bitcode symbols, not as #define-style rewrites of __nv_sinf and friends. They have distinct bodies — typically a single sin.approx.ftz.f32 call — and their existence is what allows --use_fast_math to substitute the symbol name during MLIR OpToFuncCallLowering selection without recompiling the libdevice bitcode. A reimplementation that treats __nv_fast_sinf as a macro alias of __nv_sinf will lose the FTZ behaviour the fast-path body enforces unconditionally; the slow-path body is FTZ-conditional on __CUDA_FTZ, and a fast-math build with __CUDA_FTZ=0 (the IEEE-clean default) would then silently preserve denormals where CUDA's bitcode would flush them.

Cross-references

The four-pass integration sequence that turns these declarations into concrete bodies is documented in libdevice Overview — Pipeline. The reflection keys that gate body selection are documented in NVVMReflect Mechanism — Three var-map sources. The MLIR-side rewriter that emits the __nv_* call sites these symbols define is documented in Math Pass Pipeline and Crosswalk — Full math-op crosswalk. The LLVM constant folder that classifies any surviving by-name call sites is documented in Intrinsic ID Switch and Name Table — libdevice suffix name table. The fast-math pragma that selects the __nv_fast_* family over the precision-keyed family is discussed in Fast Math and Numerical Precision.