Question about cuda exp() function

The calculation result of the cuda exp() function may be different between cuda 10.0 and cuda 11.3.
If the input is -4.75, 10.0 will give 0.00865169 and 11.3 will give 0.0086517.
Is this the result nvidia intended?
If I want to get the same result in 11.3 as in 10.0, please guide me how to do it.
Thank you

I don’t happen to have CUDA 10.0 or CUDA 11.3 handy. I happen to have CUDA 11.4, 10.2, and 9.2 installed. Based on what I see amongst those 3, there is no difference in behavior:


$ cat t2171.cu
#include <iostream>
#include <iomanip>
__global__ void k(double val, double *res){

  *res = exp(val);
}

int main(){

  double val = -4.75;
  double *r;
  cudaMallocManaged(&r, sizeof(r[0]));
  k<<<1,1>>>(val, r);
  cudaDeviceSynchronize();
  std::cout << std::fixed;
  std::cout << std::setprecision(9) << *r  << std::endl;
}
$ /usr/local/cuda-9.2/bin/nvcc -arch=sm_70 -o t2171 t2171.cu
$ ./t2171
0.008651695
$ /usr/local/cuda-10.2/bin/nvcc -arch=sm_70 -o t2171 t2171.cu
$ ./t2171
0.008651695
$ /usr/local/cuda-11.4/bin/nvcc -arch=sm_70 -o t2171 t2171.cu
$ ./t2171
0.008651695
$

Since I haven’t tested 10.0, its of course possible that 10.0 would print out something different. However in that case the suggestion would be to use CUDA 10.0 if you had some reason to prefer that output. There isn’t any way that I know of to make the exp() function in CUDA 11.4 behave any differently than what is shown above.

I suspect the argument to exp() may actually be of type float. And there seems to be indeed a 1 ulp difference between CUDA 11.8 and CUDA 9.x (I don’t have CUDA 10 handy). The following program prints GPU: r = 0.0086516952142119 3c0dbfd7 with CUDA 11.8, but GPU: r = 0.008651694282894 3c0dbfd6 with CUDA 9.x.

From looking at the respective disassemblies, the function’s implementation appears to have been changed, in particular it was optimized for higher performance and improved accuracy in the newer CUDA version. It looks like the new version could be up to twice as fast; I did not attempt to measure it. A check for accuracy across the full range of inputs indicates there is no regression in overall accuracy (maximum error in the current version is 1.93099 ulps, versus maximum error of 2.31917 ulps in the older version).

Whoever at NVIDIA noticed this optimization opportunity by taking full advantage of FMA: nice job! Apparently, I overlooked this when I did some clean-up in the CUDA math library after removal of support for compute capability 1.x.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

__global__ void kernel (float a)
{
    float r = expf (a);
    printf ("GPU: r = %.16f %08x\n", r, __float_as_int(r));
}

int main (void)
{
    kernel<<<1,1>>>(-4.75f);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

Given that assumption, then, it seems that the CUDA 10 result can be made to match the CUDA 11 result by using exp() not expf() and pre-casting the argument to double. I don’t know of a way to make the CUDA 11 result match the CUDA 10 result.

Given that the implementation changed: Short of manually re-implementing the version used by CUDA 10 for use with CUDA 11, there is no way for the two CUDA versions to match bit-wise. I don’t know when the new expf() implementation was deployed.

The important thing for programmers to keep in mind is: There can be no expectations of bit-identical results across CUDA versions or GPU architectures for any math function that is not defined to deliver correctly-rounded results. Implementations can and will differ over time, and as long as documented error bounds are maintained, there is no bug. The same applies to the math library associated with any tool chain, such as the host compiler.

Generally speaking, in floating-point computation it is always advisable to operate with error bounds, rather than expecting bit-identical results, from unit tests to the application level.

For those interested in the algorithm for expf() used by the current CUDA version, I am showing two implementation variants below. These variants are ever so slightly more accurate than CUDA’s expf(), in that the count of incorrectly rounded results across all possible inputs is lower: 154M instead of 160M. I have not run any performance comparisons.

/*
  Copyright (c) 2023, Norbert Juffa

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

#define USE_SATURATION (1)

__forceinline__ __device__ float raw_ex2 (float a)
{
    float r;
    asm ("ex2.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(a));
    return r;
}

__forceinline__ __device__ float fast_expf (float a)
{
    const float cvt_flt_int = 12582912.0f; // 0x1.8p23
    const float l2e_hi = 1.44269502e+0f; // 0x1.715476p+0  // log2(exp(1))_hi
    const float l2e_lo = 1.92594598e-8f; // 0x1.4ae000p-26 // log2(exp(1))_lo
    const float bound = 126.0f; // clamp exponent of scale factor to [-126, 126]
    const int expo_bias = 127;
    const int mant_bits = 23;
    float t, f, s;
    int i;

#if USE_SATURATION
    // max ulp error: 1.93099  number of incorrectly rounded results: 154128356
    t = __saturatef (__fmaf_rz (l2e_hi / (2 * bound), a, 0.5f));
    t = __fmaf_rz (2 * bound, t, cvt_flt_int - bound);
#else // USE_SATURATION
    // max ulp error: 1.93099  number of incorrectly rounded results: 154128165
    t = __fmaf_rz (l2e_hi, a, cvt_flt_int);
    t = fminf (cvt_flt_int + bound, fmaxf (cvt_flt_int - bound, t));
#endif // USE_SATURATION
    i = __float_as_int (t); // trunc (a * l2e) in least significant 8 bits
    f = fmaf (a, l2e_lo, fmaf (a, l2e_hi, cvt_flt_int - t));
    s = __int_as_float ((i << mant_bits) + (expo_bias << mant_bits));
    return raw_ex2 (f) * s;
}

Our customer wanted an exact match between the cpu calculations and the output of our gpu accelerated application. When using cuda 10.0, it was consistent, but when versioning to 11.3, a difference arises from these subtle differences and as a result, I want to solve this problem.
The cpu used by our customer is not Intel, but a product of another company, but please understand that it is difficult to disclose.
If so, can we assume that it was a coincidence that the exp operation has been exactly the same in cuda10.0?

Unless you use a library that you yourself control on all supported platforms, or you deploy a library that guarantees correctly-rounded results (if you can tolerate lower performance and only need a few math library functions you might want to look into the open-source CRLIBM), this is (at present) an unrealistic expectation, across any set of two hardware platforms.

Correct, any bit-wise exact matches would be coincidental. As I stated previously, it is best start thinking early on in terms of error bounds on computation rather than bit-wise identical results when dealing with floating-point computation.

@njuffa Thank you for your kind explanation.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.