Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Math Pass Pipeline + Crosswalk

Abstract

This page anchors the end-to-end translation of MLIR math.* and arith.* floating-point operations through three name-binding layers: MLIR OpToFuncCallLowering patterns emit llvm.call @__nv_<op>[f], the embedded libdevice bitcode supplies the __nv_* bodies, and the post-libdevice LLVM constant folder matches surviving call sites by Intrinsic::ID or by callee name.

It also corrects an earlier misidentification: the 8-phase LLVM pass near the libdevice consumers is not the math-to-libdevice rewriter. Math lowering happens in MLIR before libdevice is linked. The 8-phase pass is a later per-function cleanup over LLVM IR.

Post-libdevice cleanup, not math-to-libdevice

The 8-phase pass was originally hypothesized as an in-tileiras "math to libdevice" pass. That is not its role. It allocates no RewritePatternSet, configures no ConversionTarget, references no math.* mnemonic strings, references no __nv_* symbols, and walks LLVM Function ranges rather than MLIR operation graphs. Its filters match memory-reading and memory-writing LLVM instruction classes: loads, stores, calls, atomics, fences, and memset-like operations. The pass is therefore an LLVM new-PM FunctionPass running after MLIR lowering and libdevice linking, well after the __nv_* calls have already been materialized and inlined.

What the pass shares with NVVMReflect is the underlying LLVM ADT vocabulary, not the algorithm. NVVMReflect is module-global and string-keyed; this pass is per-function and pointer-keyed. NVVMReflect folds __nvvm_reflect(...) calls, then nvvm-reflect-pp removes the now-constant branches in libdevice bodies. The cleanup pass runs downstream on the simplified IR and never introduces a new __nv_* call.

Full math-op crosswalk

For every math.* / arith.* op lowered here, the important public artifacts are the libdevice entry point and, where applicable, the post-link intrinsic or name recognized by the constant folder after libdevice inlining and nvvm-reflect-pp cleanup.

math.<op> / arith.<op>f32 symbolf64 symbolConstant-folder name(s)by-name fold
arith.remf__nv_fmodf__nv_fmod_Z4fmodff / _Z4fmodddyes
arith.minnumf__nv_fminf__nv_fminLLVM MinNum nodeno
arith.maxnumf__nv_fmaxf__nv_fmaxLLVM MaxNum nodeno
math.absi__nv_absn/an/ano
math.absf__nv_fabsf__nv_fabsfabsd / llvm.nvvm.fabs.fno
math.acosh__nv_acoshf__nv_acoshlibdevice-onlyyes
math.asin__nv_asinf__nv_asinasindyes
math.atan__nv_atanf__nv_atanatandyes
math.acos__nv_acosf__nv_acosacosdyes
math.atan2__nv_atan2f__nv_atan2_Z5atan2ff / _Z5atan2ddyes
math.asinh__nv_asinhf__nv_asinhlibdevice-onlyyes
math.atanh__nv_atanhf__nv_atanhlibdevice-onlyyes
math.cbrt__nv_cbrtf__nv_cbrtlibdevice-onlyyes
math.ceil__nv_ceilf__nv_ceilceildno
math.copysign__nv_copysignf__nv_copysignllvm.copysign.*no
math.cos__nv_cosf__nv_coscosd / cosfyes
math.cosh__nv_coshf__nv_coshcoshdyes
math.erf__nv_erff__nv_erflibdevice-onlyyes
math.erfc__nv_erfcf__nv_erfclibdevice-onlyyes
math.exp2__nv_exp2f__nv_exp2exp2dyes
math.exp__nv_expf__nv_expexpd / expfyes
math.expm1__nv_expm1f__nv_expm1libdevice-onlyyes
math.floor__nv_floorf__nv_floorfloordyes
math.fma__nv_fmaf__nv_fmallvm.nvvm.fma.rn.{f,d}no
math.fpowi__nv_powif__nv_powilibdevice-onlyno
math.isfinite__nv_finitef__nv_isfinitedbit arithmeticno
math.isinf__nv_isinff__nv_isinfdbit arithmeticno
math.isnan__nv_isnanf__nv_isnandbit arithmeticno
math.log10__nv_log10f__nv_log10log10dyes
math.log1p__nv_log1pf__nv_log1plibdevice-onlyyes
math.log2__nv_log2f__nv_log2log2fyes
math.log__nv_logf__nv_loglogd / logfyes
math.powf__nv_powf__nv_powpowff / powddyes
math.roundeven__nv_rintf__nv_rintllvm.rint.f64no
math.round__nv_roundf__nv_roundlibdevice-onlyno
math.rsqrt__nv_rsqrtf__nv_rsqrtnvvm.rsqrt.approx.{f,d}no
math.sinh__nv_sinhf__nv_sinhsinhdyes
math.sin__nv_sinf__nv_sinsind / sinfyes
math.sqrt__nv_sqrtf__nv_sqrtsqrtdyes
math.tanh__nv_tanhf__nv_tanhtanhdyes
math.tan__nv_tanf__nv_tanlibdevice-onlyyes

Entries marked "libdevice-only" have no dedicated NVPTX backend intrinsic. After libdevice inline plus NVVMReflect cleanup, the body decays into a sequence of more primitive __nvvm_* intrinsics whose IDs the constant folder may recognize. The by-name folder runs against compile-time-constant inputs only: it reads Function::getName(), matches the recognized libdevice or finite-math spelling, evaluates the operation with host math routines, and constructs a ConstantFP result. The libdevice body is not invoked for IR-time constant folding.

FP32 vs FP64 — four axes of divergence

Axisf32f64
Symbol pair__nv_Xf (47 entries)__nv_X / __nv_Xd (47 entries)
Libdevice bodySeparate __nv_sinf body (Payne–Hanek f32 reduction, single-precision polynomial coefficients)Separate __nv_sin body (Payne–Hanek f64 reduction, double-precision polynomial coefficients)
Backend intrinsicTableGen suffix fsinf, cosf, expf, logf, sqrtf, powff (pow.f.f), _Z5atan2ff (atan2(float,float)), _Z4fmodffTableGen suffix dsind, cosd, expd, logd, sqrtd, powdd (pow.d.d), _Z5atan2dd, _Z4fmoddd
HW asymmetrynvptx-prec-divf32, nvptx-prec-sqrtf32, nvptx-approx-log2f32, nvptx-rsqrt-approx-opt — all PTX-ISA-level f32 selectors with no f64 counterpartf64 div is always div.rn.f64; f64 sqrt is always sqrt.rn.f64 or a libdevice fallback when HW lacks it on the target SM

The f16 and bf16 slots of these lowerings are empty: no __nv_* half-precision libdevice symbol is used. The MLIR pipeline promotes f16/bf16 to f32 via arith.extf before the libdevice call and demotes via arith.truncf after. The fp128 family is independent and softfloat-emulated; it is not driven by these OpToFuncCallLowering patterns.

Cases that skip libdevice entirely

A subset of math.* ops have libdevice bodies whose control flow is mostly __nvvm_reflect("__CUDA_PREC_*") or __nvvm_reflect("__CUDA_FTZ") tests guarding Intrinsic::nvvm_* arms. After NVVMReflect folds the reflect calls and nvvm-reflect-pp removes constant branches, the body can reduce to a single hardware intrinsic and the __nv_* call symbol disappears.

Examples:

  • math.sqrt %x : f32 with __CUDA_PREC_SQRT=0 reduces to nvvm.sqrt.approx.f; with __CUDA_PREC_SQRT=1 it reduces to nvvm.sqrt.rn.f.
  • math.rsqrt %x : f32 reduces to nvvm.rsqrt.approx.f.
  • math.sin / math.cos on f32 reduce to FTZ or non-FTZ approximate intrinsics depending on __CUDA_FTZ.
  • math.exp %x : f32 rewrites to exp2.approx.f composed with a multiply.
  • math.log2 %x : f32 rewrites to nvvm.lg2.approx.f when the approximate-log2 option is enabled.
  • math.absi inlines as (x ^ (x >> 31)) - (x >> 31).
  • math.{isnan,isinf,isfinite} reduce to bit arithmetic on the raw FP encoding.

Conversely, acosh, asinh, atanh, cbrt, erf, erfc, expm1, log1p, sinh, cosh, tanh, atan, atan2, asin, acos, tan, generic pow, remainder, fmod, and powi retain the libdevice body unless the input is a compile-time constant.

Reimplementation Notes

lower_math_op(op):
    if op.type is f16 or bf16:
        x = extf(op.input, f32)
        y = call_libdevice(f32_symbol(op.name), x)
        return truncf(y, op.type)

    if op.type is f32:
        return call_libdevice(f32_symbol(op.name), op.operands)

    if op.type is f64:
        return call_libdevice(f64_symbol(op.name), op.operands)

Constant folding is a separate LLVM-tier concern. Do not execute libdevice IR to fold constants; classify the call, evaluate the recognized math operation directly, and replace the call with a constant.

Cross-references

The four-pass integration sequence that materializes the __nv_* bodies this page lowers into is documented in libdevice Overview — Pipeline and libdevice Overview — Link, inline, simplify. The __nvvm_reflect("__CUDA_FTZ") / __CUDA_PREC_* mechanism whose folding collapses the per-arch arms is documented in NVVMReflect Mechanism — Three var-map sources. The constant-folder classifier that recognizes the post-libdevice call sites by Intrinsic::ID or by name is documented in Intrinsic ID Switch and Name Table — libdevice suffix name table. The NVPTX bring-up path that pulls libdevice into the LLVM module is documented in NVPTX Bring-up and Target Init.