Possible Rounding/Precision Errors in CUDA Math APIs?

Hello!
We are currently using the CUDA Math Library to experiment with the numerical stability of its Math APIs. To verify correctness, we compare CUDA Math APIs with the corresponding C programming math functions. We have encountered some issues, particularly with rounding errors, where C version and CUDA version results are different.
We understand that CUDA and C have different rounding mechanisms. However, we want to verify if these are actually rounding errors and the root cause for such errors. Below are the functions and the inputs we used to identify the errors:
acos - input: 0.0001590810570633039;
C result: 1.5706372261047363;
CUDA result: 1.5706373453140259

acosh- input: 2326705117069312.0
C result: 36.076377868652344
CUDA result: 36.07637405395508

asinh - input: -4.003921508789062
C result: -2.09566330909729
CUDA result: -2.095663070678711

atan - input: 191.99949645996094
C result: 1.5655879974365234
CUDA result: 1.565588116645813

atanh- input: -0.9530639052391052
C result: -1.864183783531189
CUDA result: -1.8641839027404785

cbrt - input: -3831.995849609375
C result: -15.648582458496094
CUDA result: -15.64858341217041

cosh - input: 0.125
C result: 1.0078226327896118
CUDA result: 1.0078227519989014

erfc - input: -0.00012207029794808477
C result: 1.0001376867294312
CUDA result: 1.0001378059387207

exp10- input: 0.007812499534338713
C result: 1.0181517601013184
CUDA result: 1.0181516408920288

exp2 - input: 0.06152203306555748
C result: 1.043566107749939
CUDA result: 1.0435662269592285

expm1 - input: 0.9999998211860657
C result: 1.7182813882827759
CUDA result: 1.7182812690734863

j0 - input: 0.008056640625
C result: 0.9999837875366211
CUDA result: 0.9999839067459106

j1 - input: -8192.01953125
C result: 0.00786489900201559
CUDA result: 0.00786471646279096

lgamma- input: 2097664.0
C result: 28436630.0
CUDA result: 28436628.0

log,- input: 3276800.25
C result: 15.0023775100708
CUDA result: 15.002378463745117

log10- input: 25026078.0
C result: 7.398392677307129
CUDA result: 7.398393154144287

log1p- input: 458363.46875
C result: 13.035419464111328
CUDA result: 13.035420417785645

tan - input: 0.9999999403953552
C result: 1.5574074983596802
CUDA result: 1.5574076175689697

tgamma - input: 0.0390625
C result: 25.060091018676758
CUDA result: 25.06009292602539

y0f - input: 0.008666995912790298
C result: -3.096553087234497
CUDA result: -3.096553325653076

y1 - input: 0.12500381469726562
C result: -5.199782848358154
CUDA result: -5.199781894683838

Here is what I see for the first two cases:

# cat t244.cu
#include <cstdio>
#include <cmath>

__host__ __device__
void comp(){

  double val = 0.0001590810570633039;
  printf("acos: %.15f\n", acos(val));
  val =  2326705117069312.0;
  printf("acosh: %.15f\n", acosh(val));
}

__global__ void k(){

  comp();
}

int main(){

  printf("CPU:\n");
  comp();
  printf("GPU:\n");
  k<<<1,1>>>();
  cudaDeviceSynchronize();
}
# nvcc -o t244 t244.cu
# ./t244
CPU:
acos: 1.570637245737162
acosh: 36.076376729401275
GPU:
acos: 1.570637245737162
acosh: 36.076376729401275
#

CUDA 12.2, L4 GPU, g++ 11.4

My suggestion would be for you to provide a complete test case, just as I have done, to facilitate further inspection. My guess is you are looking at float computations, not double, but I’m not sure about that. Its certainly possible for there to be differences GPU vs. CPU, and the scope of the possible error expected (vs. the “correctly-rounded result” not necessarily a CPU reference) on the GPU side is given in the programming guide.

Hi! Robert,
Thank you for your prompt reply. Below is the sample template we use to test all APIs. We focus on floating point precision at the moment.

We are in the process of developing a simple bug detector to detect floating point errors. However, we want to verify if some of our preliminary results are correct.

Detailed implementation for the source code of our bug detector can be found here:

Your reply in this regard is highly appreciated.

CUDA 10.1, g++ 9.4.0, NVIDIA 1080 Ti GPU

API Testing Template

# cat acos.c

#include <cmath>
#include <fenv.h>

extern "C" float cpp_kernel_1(float x0, int *flag) {
    float result;

    // Enable floating-point exceptions
    feclearexcept(FE_ALL_EXCEPT);
    result = acos(x0);
    int exceptions = fetestexcept(FE_ALL_EXCEPT);
    if (exceptions & FE_DIVBYZERO) {
        *flag = 1;
        return 1.0;
    }
    else if (exceptions & FE_OVERFLOW) {
        *flag = 1;
        return 2.0;
    }
    else if (exceptions & FE_INVALID) {
        *flag = 1;
        return 3.0;
    }
    else if (exceptions & FE_UNDERFLOW) {
        *flag = 1;
        return 4.0;
    }
    else {
        *flag = 0;
        return result;
    }
}

# cat acos.cu


#include <stdio.h>

__global__ void kernel_1(
  float x0,float *ret) {
   *ret = acos(x0);
}

extern "C" {
float kernel_wrapper_1(float x0) {
  float *dev_p;
  cudaMalloc(&dev_p, sizeof(float));
  kernel_1<<<1,1>>>(x0,dev_p);
  float res;
  cudaMemcpy (&res, dev_p, sizeof(float), cudaMemcpyDeviceToHost);
  return res;
  }
 }
compile the code: 
#CUDA: 
 nvcc -shared _tmp_csr-88307_10652/cuda_code_acos.cu -o _tmp_csr-88307_10652/cuda_code_acos.cu.so -Xcompiler -fPIC 
#C code. 
Running: g++ -shared _tmp_csr-88307_10652/c_code_acos.c -o _tmp_csr-88307_10652/c_code_acos.c.so -fPIC

# cat runner.py

def call_GPU_kernel_1(x0, shared_lib):
  script_dir = os.path.abspath(os.path.dirname(__file__))
  lib_path = os.path.join(script_dir, shared_lib)
  E = ctypes.cdll.LoadLibrary(lib_path)
  E.kernel_wrapper_1.restype = ctypes.c_float
  res = E.kernel_wrapper_1(ctypes.c_float(x0))
  return res

def call_CPU_kernel_1(x0, shared_lib, flag):
  script_dir = os.path.abspath(os.path.dirname(__file__))
  lib_path = os.path.join(script_dir, shared_lib)
  E = ctypes.cdll.LoadLibrary(lib_path)
  E.cpp_kernel_1.argtypes = [ctypes.c_float, ctypes.POINTER(ctypes.c_int)]
  E.cpp_kernel_1.restype = ctypes.c_float
  res = E.cpp_kernel_1(ctypes.c_float(x0), flag)
  return res

x0 = 0.0001590810570633039   # change input necessarily

gpu_start_time = time.time()
gpu_result = call_GPU_kernel_1(x0, g_shared_lib)
gpu_end_time = time.time()
gpu_elapsed = gpu_end_time - gpu_start_time

cpu_start_time = time.time()
cpu_result = call_CPU_kernel_1(x0, c_shared_lib, ctypes.byref(flag))
cpu_end_time= time.time()
cpu_elapsed = cpu_end_time - cpu_start_time

print(gpu_result ) # 1.5706373453140259
print(cpu_result)  # 1.5706372261047363

Referring to table 13 at the link I previously provided, a result from acos for float can differ from a correctly rounded result by 2 ULP (units in the last place). I don’t know what “rounding error” is but I wouldn’t call this “rounding error”. It is error, i.e. potential deviation from a correct result.

Since for my example both the CPU and GPU agree on the double variant result:

1.570637245737162

I will imagine that that is a good approximation to the “correctly rounded result”. Is the GPU float implementation:

1.5706373453140259

within 2 (float) ULP of that?

It is:

# cat t244a.cpp
#include <cstdio>

int main(){

  float x = 1.5706373453140259;
  printf("x = %.15f\n", x);
  unsigned uintxm1 = *(reinterpret_cast<unsigned *>(&x))-1;
  float xm1ulp  =*(reinterpret_cast<float *>(&uintxm1));
  printf("x- = %.15f\n", xm1ulp);
}
# g++ t244a.cpp -o t244a
# ./t244a
x = 1.570637345314026
x- = 1.570637226104736
#

We see that one possible float bit pattern produces 1.5706372261… and the next higher float bit pattern produces 1.5706373453… The “correctly rounded” value of 1.570637245… falls in-between those two bit patterns. However the value returned by the CUDA GPU float implementation is less than 1 ULP from the correctly rounded result. Therefore it is within the published error bound of 2 ULP.

CUDA is behaving as expected. If you want to call that a “rounding error” you are welcome to do so. To me it is just “error”. In this particular case, you probably wouldn’t have complained if CUDA had “rounded the other way”, so I see how one might call it a rounding error. But given the published error bound of 2ULP, we can surmise (or expect) that there will be situations where the result is off from the correctly rounded result by more than what one would predict if the only issue was the direction of rounding to the nearest bit pattern.

To find out if CUDA is behaving as expected in the other cases requires a similar analysis of some kind. You’re welcome to do that if you wish.

Also, for future readers, I’m aware that the methodology I’ve used here is not universally applicable to every such problem. However, it is sufficient for this particular numerical value/case, as a method to find the two float representable values that are nearest to the correctly rounded result.

Thank you for your reply. Like you said, we understand it is a deviation of the result. What we define as “rounding error” is the inconsistency in the results to a certain extent of precision after performing the computation. We may need to change the term from “rounding error” to maybe “precision error or approximation error” ?

I’m not an expert on these things. As long as you use the word “error”, it seems to me you will be close to the mark. Someone else may be able to properly define “rounding error” (or any other term), in case I have not. It might be that “rounding error” is the correct usage for this, I don’t know.