Unacceptably low accuracy of calculations on the GPU

RomualdVII · April 30, 2024, 11:00pm

Good afternoon, everyone!
I’ll start, if you don’t mind, with some background.
More than a year ago I learned OpenGL (together with compute shaders). Not too long ago I needed to port a complex geodesic algorithm to the GPU, as it made sense to do so. I, of course, wrote a very large compute shader, after which I discovered that float precision is not enough in many scenarios. I implemented my sine, cosine and arctangent functions for double precision. This worked well in tests. However, when it came to actually using my implementation, it turned out that the drivers crashed my program when the compute shader runtime exceeded a certain threshold (about 1.8 seconds).
Realizing the hopelessness of the situation, I decided to switch to OpenCL. Having transferred my code to OpenCL, I found out that the results produced by GPU differ from those produced by CPU by 10 decimal places (only 6 first decimal places are the same). I spent about a month on various forums trying to figure out what the problem was. In short, the conclusion was that I had to switch to CUDA because I could NOT control the kernel code compilation process from the OpenCL API.
I learned CUDA enough to implement my program in CUDA C. Nevertheless, I again encounter exactly the same behavior - the CPU outputs 16 correct decimal places, while the GPU outputs 6 correct and 10 incorrect ones. This behavior does not change no matter what compile options I use during the nvcc call.
What I’m trying to do: I have a function to check if a point belongs to a pair of arcs of a great circle on the surface of a spherical Earth. This function checks by comparing the parts of the arcs into which the point divides the original arcs. Accordingly, there is a second function that calculates the arc length of the great circle. In the code below, I hope, everything is transparent, so I won’t describe everything in more detail.

#include <stdio.h>
#include <math.h>

__global__ void mainKernelProgram();
__device__ bool pointCheck(double2 start1, double2 end1, double2 start2, double2 end2, double2 point);
__device__ double distanceBetweenCoord(double2 point1, double2 point2);

__global__ void mainKernelProgram() {

    double2 start1 = {1.092964656946529223, 1.091676869248603632};
    double2 end1   = {0.397907894678253549, 0.070910999632570137};
    double2 start2 = {0.892362913180780204, 0.513485940430959964};
    double2 end2   = {0.891949846176460226, 0.513485940430959964};
    double2 p1v2   = {0.891949793450346418, 0.513485940430929766};

    bool pointON = pointCheck(start1, end1, start2, end2, p1v2);
    printf("point ON: %s\n", pointON ? "true" : "false");

}

__device__ bool pointCheck(double2 start1, double2 end1, double2 start2, double2 end2, double2 point)
{
    double totalDist1 = distanceBetweenCoord(start1, end1);
    double distBefore1 = distanceBetweenCoord(start1, point);
    double distAfter1 = distanceBetweenCoord(point, end1);

    double totalDist2 = distanceBetweenCoord(start2, end2);
    double distBefore2 = distanceBetweenCoord(start2, point);
    double distAfter2 = distanceBetweenCoord(point, end2);

    printf("NV CUDA totalDist1  = %.30e\n", totalDist1);
    printf("NV CUDA distBefore1 = %.18e\n", distBefore1);
    printf("NV CUDA distAfter1  = %.18e\n", distAfter1);
    printf("NV CUDA totalDist2  = %.18e\n", totalDist2);
    printf("NV CUDA distBefore2 = %.18e\n", distBefore2);
    printf("NV CUDA distAfter2  = %.18e\n", distAfter2);

    double a = totalDist1 - distBefore1 - distAfter1;
    double b = totalDist2 - distBefore2 - distAfter2;
    double c = fabs(a - b);

    if (fabs(a) > 0.001)
        return false;

    if (fabs(b) > 0.001)
        return false;

    if (c < 0.001)
        return true;
    else
        return false;
}

__device__ double distanceBetweenCoord(double2 point1, double2 point2)
{
    double R = 6371110.0;

    double phi1 = point1.x;
    double lambda1 = point1.y;

    double phi2 = point2.x;
    double lambda2 = point2.y;

    double delta_lambda = lambda2 - lambda1;

    double cosphi1 = cos(phi1);
    double cosphi2 = cos(phi2);
    double sinphi1 = sin(phi1);
    double sinphi2 = sin(phi2);
    double cosdeltalambda = cos(delta_lambda);

    double p1 = sin(delta_lambda) * cosphi2;
    double p2 = cosphi1 * sinphi2 - sinphi1 * cosphi2 * cosdeltalambda;
    // double p2_h1 = cosphi1 * sinphi2;
    // double p2_h2 = sinphi1 * cosphi2 * cosdeltalambda;
    // double p2 = p2_h1 - p2_h2;

    double q = sinphi1 * sinphi2 + cosphi1 * cosphi2 * cosdeltalambda;
    // double q_h1 = sinphi1 * sinphi2;
    // double q_h2 = cosphi1 * cosphi2 * cosdeltalambda;
    // double q = q_h1 + q_h2;

    double res = fabs(atan2(sqrt(p1 * p1 + p2 * p2), q) * R);
    // double res_spqr = sqrt(p1 * p1 + p2 * p2);
    // double res_atan2 = atan2(res_spqr, q);
    // double res_atan2_R = res_atan2  * R;
    // double res = fabs(res_atan2_R);

    return res;
}


int main(void)
{
    mainKernelProgram<<<1, 1>>>();

    return 0;
}

What the output of a CPU program looks like:

CPU totalDist1  = 6.177021452302754857e+06
CPU distBefore1 = 2.346662786588795017e+06
CPU distAfter1  = 3.830358665713958908e+06
CPU totalDist2  = 2.631695321893054370e+03
CPU distBefore2 = 2.632031245838992618e+03
CPU distAfter2  = 3.359239459231472269e-01
point ON: false

And now what the output of the GPU program looks like:

NV CUDA totalDist1  = 6.177021452302754857e+06
NV CUDA distBefore1 = 2.346662786588686518e+06
NV CUDA distAfter1  = 3.830358665714067407e+06
NV CUDA totalDist2  = 2.631695321893054370e+03
NV CUDA distBefore2 = 2.632031245764014784e+03
NV CUDA distAfter2  = 3.359238709455678040e-01
point ON: false

You can see that the values of the first variable (totalDist1) match completely, which is great because it highlights my problem perfectly. The values of the second variable (distBefore1) match at only 13 decimal places, and that’s not bad in terms of the required precision. However, the last variables (distAfter2) match by only 6 decimal places.
In this particular example, these six decimal places are not critical, but the values calculated by the distanceBetweenCoord() function are also used in other places of the program, where the accuracy will drop even more, and six correct decimal places will not be enough.
What have I tried to do? Lots of things. I have divided complex operations into parts (commented lines in the code), tried various nvcc flags, but all this has no effect.
Example:

nvcc -x cu main.cpp
NV CUDA distAfter2  = 3.359238708155382058e-01
nvcc -x cu main.cpp --fmad=false --gpu-architecture=compute_89 -ftz=false -prec-div=true -prec-sqrt=true
NV CUDA distAfter2  = 3.359238709455678040e-01

The second value is better than when compiling without flags, but it is equally wrong globally.
When I was on the OpenCL forums, I was told that the GPU physically cannot count as accurately as the CPU due to the lack of a math coprocessor on the GPU. However, this does not match reality. As I mentioned, the values of the first variable match completely in the bitwise representation (I output many decimal places via printf() and they all matched, which I think indicates that the bitwise representation was the same). Hence, the GPU CAN calculate as accurately as the CPU. But it won’t.
Actually, this is my question - how to make the program, first, make all calculations with the same accuracy (instead of 16/13/…/6 matching digits), and, second, make the calculations match the CPU version?
Thank you so much for any help you can give!
P.S. I’m using CUDA 12.4 along with an RTX 4080. Realizing that this is a gaming GPU, I tested the program on our NVidia A100 running CUDA 12.1. There are no differences. It’s clearly not about game/computing optimized drivers.
P.P.S. I’d also like to point out that in terms of the problem I’m describing, the computational shader (with my redefined trigonometry) works much better. In addition to the fact that the accuracy there does not “jump”, the values are 5-10 times more accurate.

Robert_Crovella · April 30, 2024, 11:56pm

I’m going to use “bitwise identical” as the definition for “match” below, which is I’m sure how you intend.

One thing to be aware of when using trig functions like cos(), sin(), and atan2() is that these functions are provided by a library, and are not directly provided by a single machine instruction. The library implementation has documented error bounds for various functions.

From that table we can see that sin, cos, and atan2 all have documented error bounds of 2 ULP, which means the result provided by the library may not be the correctly rounded result. If you are comparing that to a library routine (on the CPU), then differences are possible (see below). In that case, expecting a perfect match is not likely to be fruitful across a range of calculations.

It is possible, of course, for the GPU math library to provide a correctly rounded result, in some cases, even when the specified error bounds are greater than 0 ULP. So the fact that you have one result that appears to precisely match between GPU and CPU is consistent with this idea, and does not necessarily mean that every other comparison should precisely match.

Based on the above, there are 2 possibilities for the CPU library that is being used for comparison. Let’s take a specific case like cos(). Either it provides 0 ULP result or it provides a result with a wider error bound than 0 ULP.

If it provides a 0 ULP result, and the GPU implementation does not, then they may not match.

if it provides a result with an error bound greater than 0 ULP, then again, there is no guarantee of a match between CPU and GPU results.

You have compound calculations. By that I mean that some calculations depend on previous calculations. For example your atan2() calculation depends on previous results from e.g. sin() and cos(). I think it might be self evident that when you have compound calculations, errors can be compounded. So compound calculations using library functions that each have 2 ULP error can result in a growing error bound, the more compound results you compute.

Does my description so far explain fully all your results? It does not. It’s quite possible that with careful analysis, errors might be “improved”. Another discovery people sometimes make (although I’m not suggesting it applies here) is that the GPU calculated result is sometimes more accurate than the CPU calculated result. People tend to assume if there is a difference, the GPU must be in error, but this is not always the case. Let me repeat again: I’m not saying its applicable here.

In order to sort this out, from my perspective (njuffa may know shortcuts) it would be necessary to track each calculation comparatively, and do the same calculation on a “known good extended precision calculator” (such as the windows calculator tool, in my experience) and see which result is more accurate at each step. Then look for the source of the error in each step. In so doing, you might discover ways to make things more accurate.

But the initial recital still applies: some of your goals, as stated, are not achievable.

FWIW, if you wanted to continue this discussion, my suggestion would be to provide the CPU version of the code that you provide results for, as well as indicating compilation specifics. When I did that with your code provided, I got more than 6 matching digits on the result with the most difference between CPU and GPU results.

Here is my test case so far:

# cat t187.cu
#include <stdio.h>
#include <math.h>

#ifdef USE_CPU
#define __G__
#else
#define __G__ __global__
#endif

__G__ void mainKernelProgram();
__host__ __device__ bool pointCheck(double2 start1, double2 end1, double2 start2, double2 end2, double2 point);
__host__ __device__ double distanceBetweenCoord(double2 point1, double2 point2);

__G__ void mainKernelProgram() {

    double2 start1 = {1.092964656946529223, 1.091676869248603632};
    double2 end1   = {0.397907894678253549, 0.070910999632570137};
    double2 start2 = {0.892362913180780204, 0.513485940430959964};
    double2 end2   = {0.891949846176460226, 0.513485940430959964};
    double2 p1v2   = {0.891949793450346418, 0.513485940430929766};

    bool pointON = pointCheck(start1, end1, start2, end2, p1v2);
    printf("point ON: %s\n", pointON ? "true" : "false");

}

__host__ __device__ bool pointCheck(double2 start1, double2 end1, double2 start2, double2 end2, double2 point)
{
    double totalDist1 = distanceBetweenCoord(start1, end1);
    double distBefore1 = distanceBetweenCoord(start1, point);
    double distAfter1 = distanceBetweenCoord(point, end1);

    double totalDist2 = distanceBetweenCoord(start2, end2);
    double distBefore2 = distanceBetweenCoord(start2, point);
    double distAfter2 = distanceBetweenCoord(point, end2);
#ifdef USE_CPU
    const char t[] = "CPU\0";
#else
    const char t[] = "GPU\0";
#endif
    printf("%s totalDist1  = %.30e\n",t, totalDist1);
    printf("%s distBefore1 = %.18e\n",t, distBefore1);
    printf("%s distAfter1  = %.18e\n",t, distAfter1);
    printf("%s totalDist2  = %.18e\n",t, totalDist2);
    printf("%s distBefore2 = %.18e\n",t, distBefore2);
    printf("%s distAfter2  = %.18e\n",t, distAfter2);

    double a = totalDist1 - distBefore1 - distAfter1;
    double b = totalDist2 - distBefore2 - distAfter2;
    double c = fabs(a - b);

    if (fabs(a) > 0.001)
        return false;

    if (fabs(b) > 0.001)
        return false;

    if (c < 0.001)
        return true;
    else
        return false;
}

__host__ __device__ double distanceBetweenCoord(double2 point1, double2 point2)
{
    double R = 6371110.0;

    double phi1 = point1.x;
    double lambda1 = point1.y;

    double phi2 = point2.x;
    double lambda2 = point2.y;

    double delta_lambda = lambda2 - lambda1;

    double cosphi1 = cos(phi1);
    double cosphi2 = cos(phi2);
    double sinphi1 = sin(phi1);
    double sinphi2 = sin(phi2);
    double cosdeltalambda = cos(delta_lambda);

    double p1 = sin(delta_lambda) * cosphi2;
    double p2 = cosphi1 * sinphi2 - sinphi1 * cosphi2 * cosdeltalambda;
    // double p2_h1 = cosphi1 * sinphi2;
    // double p2_h2 = sinphi1 * cosphi2 * cosdeltalambda;
    // double p2 = p2_h1 - p2_h2;

    double q = sinphi1 * sinphi2 + cosphi1 * cosphi2 * cosdeltalambda;
    // double q_h1 = sinphi1 * sinphi2;
    // double q_h2 = cosphi1 * cosphi2 * cosdeltalambda;
    // double q = q_h1 + q_h2;

    double res = fabs(atan2(sqrt(p1 * p1 + p2 * p2), q) * R);
    // double res_spqr = sqrt(p1 * p1 + p2 * p2);
    // double res_atan2 = atan2(res_spqr, q);
    // double res_atan2_R = res_atan2  * R;
    // double res = fabs(res_atan2_R);

    return res;
}


int main(void)
{
    mainKernelProgram
#ifndef USE_CPU
            <<<1, 1>>>
#endif
            ();
    cudaDeviceSynchronize();
    return 0;
}
# nvcc -o t187 t187.cu -DUSE_CPU
# ./t187
CPU totalDist1  = 6.177021452302754856646060943604e+06
CPU distBefore1 = 2.346662786588686518e+06
CPU distAfter1  = 3.830358665714067873e+06
CPU totalDist2  = 2.631695321893054370e+03
CPU distBefore2 = 2.632031245764014784e+03
CPU distAfter2  = 3.359238709455677485e-01
point ON: false
# nvcc -o t187 t187.cu
# ./t187
GPU totalDist1  = 6.177021452302754856646060943604e+06
GPU distBefore1 = 2.346662786588686518e+06
GPU distAfter1  = 3.830358665714067407e+06
GPU totalDist2  = 2.631695321892939319e+03
GPU distBefore2 = 2.632031245763981133e+03
GPU distAfter2  = 3.359238708155382058e-01
point ON: false
#

It looks to me like there are either 8 or 9 matching digits in the “distAfter2” result.

When I compile the above code (GPU version) with -fmad=false I get considerably more matching digits:

# nvcc -o t187 t187.cu -fmad=false
# ./t187
GPU totalDist1  = 6.177021452302754856646060943604e+06
GPU distBefore1 = 2.346662786588686518e+06
GPU distAfter1  = 3.830358665714067407e+06
GPU totalDist2  = 2.631695321893054370e+03
GPU distBefore2 = 2.632031245764014784e+03
GPU distAfter2  = 3.359238709455678040e-01
point ON: false
#

This is not necessarily an indicator that -fmad=false is better. Usually the opposite is the case. But I couldn’t be sure of which without doing the experiment I outlined earlier. And my CPU results don’t “match” yours, so it would still be necessary, as I suggested earlier, for you to provide a description of how you got your CPU results.

njuffa · May 1, 2024, 12:49am

(0) Is this code based on Vincenty’s formula for great circle distance on a sphere? It does not seem to be the haversine formula, but it also does not seem to match my implementation of Vincenty’s formula (which assumes the points are passed by latitude / longitude, which does not seem to be the case here).

(1) Assuming any host computation to deliver “golden” (correct) results that ought to be matched is a common fallacy and poor science / engineering. One practical way to assess actual accuracy properly is to compute a reference with higher precision. I would suggest at least three times the target precision, but there can be situations (most of them contrived, however) where even that is insufficient. In general, this requires use of a multi-precision or arbitrary-precision math library as the reference.

(2) The fact that turning off FMA-merging for the GPU implementation (-fmad=false) delivers results that are closer to the CPU implementation suggests that there are issues with subtractive cancellation in this code that are not being addressed. Generally speaking allowing the use of FMA partially protects from subtractive cancellation and reduces overall rounding error, so we would want to use it to help with accuracy. It may be necessary to place the FMA explicitly by hand, by calling fma(), based on numerical analysis.

(3) The code shown does not avail itself of well-known accuracy-improving primitives. For example, sqrt(p1 * p1 + p2 * p2) should be hypot (p1, p2).

(4) I would have to study the underlying computation more closely, but it seems worth examining whether trigonometric computation can be partially replaced by algebraic computation or transformed into a numerically more advantageous arrangement. Finite-precision computation via trig and inverse trig function can get numerically iffy due to {very shallow | very steep slopes}, e.g. near angles of 90 degrees.

(5) As Robert Crovella already pointed out, implementations of trigonometric functions differ between programming platforms and do not deliver bit-wise identical results across platforms. However, at this stage I am not convinced that this is the most significant numerical issue in this code. A useful, thorough, independent, comparative assessment of math function accuracy that also covers CUDA is provided by the following publication:

Brian Gladman , Vincenzo Innocente, John Mather, and Paul Zimmermann, “Accuracy of Mathematical Functions in Single, Double, Double Extended, and Quadruple Precision”, HAL pre-print 03141101, Feb. 2024 (online)

[later:]
For comparison purposes, here is a distance computation on a sphere based on the Vincenty formula that I coded up many years ago, using all best practices for numerically robust computation that I am aware of. I would have to spend some time re-acquainting myself with these computations to compare to the code posted above, and to set up a framework to assess the error properly.

/*
  diff_of_products() computes a*b-c*d with a maximum error < 1.5 ulp

  Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller, 
  "Further Analysis of Kahan's Algorithm for the Accurate Computation 
  of 2x2 Determinants". Mathematics of Computation, Vol. 82, No. 284, 
  Oct. 2013, pp. 2245-2264
*/
float diff_of_productsf (float a, float b, float c, float d)
{
    float w = d * c;
    float e = fmaf (-d, c, w);
    float f = fmaf (a, b, -w);
    return f + e;
}

float sum_of_productsf (float a, float b, float c, float d)
{
    float w = c * d;
    float e = fmaf (c, -d, w);
    float f = fmaf (a, b, w);
    return f - e;
}

/*
   lat1, lon1  latitude and longitude of first point, in degrees [-90,+90]
   lat2, lon2  latitude and longitude of second point, in degrees [-180,+180]
   r           radius of the earth in user-defined units, e.g. 6378.2 km or 
               3963.2 miles
   returns:    distance of the two points, in the same units as radius
*/
float distance (float lat1, float lon1, float lat2, float lon2, float r)
{
    float dlon, d1, d2, s1, s2, c1, c2, a, b, c, t;

    sincospif (lat1 / 180.0f, &s1, &c1);
    sincospif (lat2 / 180.0f, &s2, &c2);
    dlon = lon2 - lon1;
    sincospif (dlon / 180.0f, &d1, &d2);
    a = c2 * d1;
    t = c2 * d2;
    b = diff_of_productsf (c1, s2, s1, t); // c1 * s2 - s1 * t
    t = sum_of_productsf  (s1, s2, c1, t); // s1 * s2 + c1 * t
    c = atan2f (hypotf (a, b), t);
    return r * c;
}

[Even later:]

Doing a side-by-side comparison, distanceBetweenCoord() in OP’s code and distance() in my sample code appear to be functionally equivalent, with the difference that the former apparently operates on inputs given in radian, while the latter expects inputs in degree and therefore needs to divide by 180 first. In other words, both codes are implementations of Vincenty’s great circle distance formula and arrange the computation in roughly the same way.

njuffa · May 1, 2024, 2:34am

I made the following changes (same for CUDA but with __device__ attribute):

/*
  diff_of_products() computes a*b-c*d with a maximum error < 1.5 ulp

  Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller, 
  "Further Analysis of Kahan's Algorithm for the Accurate Computation 
  of 2x2 Determinants". Mathematics of Computation, Vol. 82, No. 284, 
  Oct. 2013, pp. 2245-2264
*/
double diff_of_products (double a, double b, double c, double d)
{
    double w = d * c;
    double e = fma (-d, c, w);
    double f = fma (a, b, -w);
    return f + e;
}

double sum_of_products (double a, double b, double c, double d)
{
    double w = c * d;
    double e = fma (c, -d, w);
    double f = fma (a, b, w);
    return f - e;
}

double distanceBetweenCoord(double2 point1, double2 point2)
{
    double R = 6371110.0;

    double phi1 = point1.x;
    double lambda1 = point1.y;

    double phi2 = point2.x;
    double lambda2 = point2.y;

    double delta_lambda = lambda2 - lambda1;

    double cosphi1 = cos(phi1);
    double cosphi2 = cos(phi2);
    double sinphi1 = sin(phi1);
    double sinphi2 = sin(phi2);
    double cosdeltalambda = cos(delta_lambda);
    double sindeltalambda = sin(delta_lambda);

    double p1 = sindeltalambda * cosphi2;
    double t  = cosdeltalambda * cosphi2;
    double p2 = diff_of_products (cosphi1, sinphi2, sinphi1, t);
    double q  = sum_of_products  (sinphi1, sinphi2, cosphi1, t);
    double res = atan2 (hypot (p1, p2), q) * R;

    return res;
}

Building the C++ code standalone with the Intel compiler (Classic Version 2021.8.0, Build 20221119_000000) and its super-accurate math library and requesting strict IEEE-754 compliance (/fp:strict, -fp-model=strict), I get this when running on an Intel Xeon W-2133 (Skylake-W):

CPU totalDist1  = 6.177021452302753925323486328125e+06
CPU distBefore1 = 2.346662786588686518e+06
CPU distAfter1  = 3.830358665714067873e+06
CPU totalDist2  = 2.631695321893043911e+03
CPU distBefore2 = 2.632031245764015694e+03
CPU distAfter2  = 3.359238709723418315e-01
point ON: false

Building with CUDA 12.3 (release 12.3, V12.3.107) and running on a Quadro RTX 4000, I get:

NV CUDA totalDist1  = 6.177021452302753925323486328125e+06
NV CUDA distBefore1 = 2.346662786588686518e+06
NV CUDA distAfter1  = 3.830358665714067407e+06
NV CUDA totalDist2  = 2.631695321893043911e+03
NV CUDA distBefore2 = 2.632031245764015239e+03
NV CUDA distAfter2  = 3.359238709723418870e-01
point ON: false

Comparing line by line, we have:

CPU distBefore1 =     2.346662786588686518e+06
NV CUDA distBefore1 = 2.346662786588686518e+06
                                              ^ 
CPU distAfter1 =      3.830358665714067873e+06
NV CUDA distAfter1 =  3.830358665714067407e+06
                                       ^ 
CPU distBefore2 =     2.632031245764015694e+03
NV CUDA distBefore2 = 2.632031245764015239e+03
                                       ^
CPU distAfter2 =      3.359238709723418315e-01
NV CUDA distAfter2 =  3.359238709723418870e-01
                                       ^

The differences occur at the 17th digit, which is as good as one can hope for with double computation mapped to IEEE-754 binary64 format. If performance is important, you would want to replace:

    double cosphi1 = cos(phi1);
    double cosphi2 = cos(phi2);
    double sinphi1 = sin(phi1);
    double sinphi2 = sin(phi2);
    double cosdeltalambda = cos(delta_lambda);
    double sindeltalambda = sin(delta_lambda);

with

    double cosphi1, cosphi2, sinphi1, sinphi2, cosdeltalambda, sindeltalambda;
    sincos (phi1, &sinphi1, &cosphi1);
    sincos (phi2, &sinphi2, &cosphi2);
    sincos (delta_lambda, &sindeltalambda, &cosdeltalambda);

Note that sincos() is a non-standard extension to C++, but a common one. CUDA supports it, and the Intel compiler supports it when one uses header file mathimf.h instead of math.h.

RomualdVII · May 1, 2024, 2:46pm

A huge thank you to both of you!
I have carefully read everything you have written.
I don’t know why I didn’t check the results I was getting earlier in the symbolic math system (to get absolute accuracy), but here’s what MatLab R2020b says:

MatLab totalDist1  = 6.177021452302754857e+06
MatLab distBefore1 = 2.3466626278658868656518e+06
MatLab distAfter1  = 3.83035358665714067407e+06
MatLab totalDist2  = 2.63161695321893054370e+03
MatLab distBefore2 = 2.632031245764014784e+03
MatLab distAfter2  = 3.359238709455678040e-01

This came as a shock to me. MatLab’s results completely matched my GPU’s results. I now have no doubt that the GPU is calculating correctly. Unlike the CPU.
The thing is, the results for the CPU I provided earlier were obtained by a program written in Qt5 and compiled with MinGW 11 64-bit. I rewrote the program in the Qt-free style and compiled it with cl.exe (64 bit). And this is what I got:

R9_5900x CPU totalDist1  = 6.17702121452302752754857e+06
R9_5900x CPU distBefore1 = 2.34666278658868686518e+06
R9_5900x CPU distAfter1  = 3.83035358665714067873e+06
R9_5900x CPU totalDist2  = 2.631695321893054370e+03
R9_5900x CPU distBefore2 = 2.632031245764014784e+03
R9_5900x CPU distAfter2  = 3.359238709455677485e-01

I’ve been programming in Qt for over three years now (on both Windows and Ubuntu at the same time (I get the same results there and there)), and until now I couldn’t even suspect Qt of contributing any of its “quirks”.
Why I didn’t suspect Qt, but blamed the GPU - as I mentioned, I have a large program. It runs completely correctly on the CPU (Qt), but it “locks up” at some point while computing on the GPU, unable to exit by condition from some while loop inside the program. I found this place, and I’ve given it as an example here.
In fact, this is getting more and more interesting, especially now that I know that CPU computation completes correctly although it is not running correctly, while GPU computation does exactly the opposite.
In any case, thank you very much again for your help and for a number of recommendations, which I will certainly heed!

Robert_Crovella · May 1, 2024, 3:19pm

If we take the matlab results as “ground truth”, its surprising to me that the -fmad=false case gives a better result on the GPU (at least for distAfter2 calculation).

RomualdVII · May 1, 2024, 5:24pm

I will note that this is indeed the case:

MatLab distAfter2        = 3.359238709455678040e-01
R9_5900x CPU distAfter2  = 3.359238709455677485e-01
> nvcc -x cu main.cpp
  NV CUDA distAfter2  = 3.359238708155382058e-01
> nvcc -x cu main.cpp --fmad=false
  NV CUDA distAfter2  = 3.359238709455678040e-01

At this time, I trust MatLab, so I do suggest taking its results as a benchmark.

njuffa · May 1, 2024, 5:52pm

I do not know how MatLab computes internally. My assumption is that it computes in IEEE-754 double precision by default. If so that would seem to be just another instance of the issue I raised in point (2) of my first post in this thread. This is sometimes expressed in a humorous manner as “A person with one watch always knows what time it is, a person with two watches can never be sure”.

It is important to understand the numerical properties of a computation. When it is affected by numerical issues like subtractive cancellation, the effect is a bit like injecting random noise into the computation. The final results for any particular test vector may happen to match across platforms, but more likely than not they don’t. I can recommend the following publication as an aid for gaining a deeper understanding: Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms, 2^nd ed, SIAM 2002.

I still would strongly recommend setting up test scaffolding that checks the accuracy of the implementation against a high-precision reference computation with about150 bits of precision using a wide selection of test data, in particular also for edge cases like pairs of near-antipodal points, pairs of points near the poles, etc.

Robert_Crovella · May 1, 2024, 5:56pm

At 17 digits, we might be beyond the default matlab range. It might be interesting to try increasing the matlab precision beyond default.

njuffa · May 1, 2024, 6:18pm

My point is we should not blindly trust any black box. I have been bitten before when using Wolfram Alpha when further analysis strongly indicated that it had used double-precision computation where I thought I was getting high-precision computation.

While arbitrary-precision libraries are usually open source and therefore open to inspection, we might not want to trust a single one either where correct results are critical. After all, software bugs are a thing.

RomualdVII · May 1, 2024, 7:18pm

That’s probably true, but more interesting, I think, are the implementations of methods to hold the accuracy of the calculations.

Anyway, I did the symbolic calculation, and only addressed the results at the very end, using the vpa() function at digits(16) and digits(32).
What we have now:

digits(16)
  MatLab distAfter2  = 0.3359238709455676
digits(32)
  MatLab distAfter2  = 0.33592387094556759875989324099645
numerical computation (reminder)
  MatLab distAfter2  = 0.3359238709455678040

As we can see, the numerical calculation error is in the sixteenth decimal place, which is fully consistent with IEEE-754 and the theoretical 64-bit maximum of ~15.9 decimal places.
Anyway, I continue to think that we can trust MatLab’s numerical calculations.

njuffa · May 1, 2024, 11:13pm

The guiding philosophy needs to be “trust but verify”. The literature is full of claims that under close scrutiny are either not quite true (if not outright false) or are lacking important qualifiers.

The development of any numerical code certainly starts with high-level algorithm design, where one pays attention to potential sources of subtractive cancellation and error magnification, for example. This is followed by a “mapping” phase, where one selects the best low-level primitives for the task at hand, e.g. fma(), hypot(), sinpi() and so on. Importantly, these two phases can inform each other, e.g. we might choose an algorithm that more easily exploits accurate primitives available on the target platform. After implementation we always need a verification phase. That starts with a simple smoke test (this often involves throwing a bunch of random data at the code under test), and proceeds to white-box testing, coverage analysis and performance benchmarking.

Below I am showing a simple smoke test I applied to four different variants of the distance computation on a sphere: two variants of the haversine formula, a formula based on chords, and the Vincenty formula. The results indicate that for a sphere the size of the Earth, and when performing computation with double, the haversine and chord variants can compute the distance between any two points with accuracy on the order of 10 micrometers (worst case errors occur with near-antipodal points), while the Vincenty formula can compute the distance with an accuracy on the order of 10 nanometers. That seems like overkill to me, especially given the fact that the Earth is not a sphere but more closely resembles an ellipsoid, which makes accurate distance calculations much more challenging.

I built this code on a Windows machine with MSVC 2019 and CUDA 12.3. The arbitrary-precision library I used is R. P. Brent’s MP from 1978 (Fortran code, my glue code is not publicly available) configured for 250-bit precision, for which you would want to substitute the arbitrary-precision library of your choice. I built with

nvcc -Xcompiler "/W4 /O2 /arch:AVX512 /favor:INTEL64 /fp:strict" -arch=sm_75 -o sphere_distance.exe sphere_distance.cu mp.lib

Note the specification of /fp:strict to prevent the host compiler from re-arranging (re-associating) floating-point expressions for performance, which can hurt accuracy. As far as I am aware, with the exception of FMA-contraction (FMUL with dependent FADD contracted to FMA), which can be turned off, the CUDA compiler does not re-associate floating-point expressions.

#include <stdio.h>
#include <stdio.h>
#include <stdint.h>
#include <math.h>
#include "mpglue.h"

#define HAVERSINE1 (1)
#define HAVERSINE2 (2)
#define CHORD      (3)
#define VINCENTY   (4)

#define VARIANT    (VINCENTY)
#define REL_ERR    (0)          // 0 = report abs. err; else report rel. err.
#define TESTS      (1000000)
#define USE_GPU    (1)

#define R  (6371110.0) // radius of the Earth in meters

/*
  diff_of_products() computes a*b-c*d with a maximum error < 1.5 ulp

  Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller, 
  "Further Analysis of Kahan's Algorithm for the Accurate Computation 
  of 2x2 Determinants". Mathematics of Computation, Vol. 82, No. 284, 
  Oct. 2013, pp. 2245-2264
*/
__host__ __device__ double diff_of_products (double a, double b, double c, double d)
{
    double w = d * c;
    double e = fma (-d, c, w);
    double f = fma (a, b, -w);
    return f + e;
}

__host__ __device__ double sum_of_products (double a, double b, double c, double d)
{
    double w = c * d;
    double e = fma (c, -d, w);
    double f = fma (a, b, w);
    return f - e;
}

__host__ double norm3d (double a, double b, double c)
{
    return hypot (a, hypot (b, c));
}

__host__ __device__ double distance_chord (double lat1, double lon1, double lat2, double lon2, double radius)
{
    double s1, s2, s3, s4, c1, c2, c3, c4, a, b, c;

    sincos (lat1, &s1, &c1);
    sincos (lat2, &s2, &c2);
    sincos (lon1, &s3, &c3);
    sincos (lon2, &s4, &c4);
    a = diff_of_products (c2, c4, c1, c3);
    b = diff_of_products (c2, s4, c1, s3);
    c = s2 - s1;
    c = norm3d (a, b, c);
    c = 2.0 * asin (0.5 * c);
    return c * radius;
}

__host__ __device__ double distance_vincenty (double lat1, double lon1, double lat2, double lon2, double radius)
{
    double dlon, d1, d2, s1, s2, c1, c2, a, b, c, t;

    dlon = lon2 - lon1;
    sincos (lat1, &s1, &c1);
    sincos (lat2, &s2, &c2);
    sincos (dlon, &d1, &d2);
    a = c2 * d1;
    t = c2 * d2;
    b = diff_of_products (c1, s2, s1, t);
    t = sum_of_products  (s1, s2, c1, t);
    c = atan2 (hypot (a, b), t);
    return c * radius;
}

__host__ __device__ double distance_haversine1 (double lat1, double lon1, double lat2, double lon2, double radius)
{
    double dlat, dlon, c1, c2, s1, s2, a, c, t;

    dlat = lat2 - lat1;
    dlon = lon2 - lon1;
    c1 = cos (lat1);
    c2 = cos (lat2);
    s1 = sin (dlat / 2);
    s2 = sin (dlon / 2);
    t = s2 * s2 * c1 * c2;
    a = fma (s1, s1, t);
    c = 2.0f * asin (fmin (1.0, sqrt (a)));
    return c * radius;
}

__host__ __device__ double distance_haversine2 (double lat1, double lon1, double lat2, double lon2, double radius)
{
    double dlat, dlon, c1, c2, s1, s2, a, c, t;
    
    dlat = lat2 - lat1;
    dlon = lon2 - lon1;
    c1 = cos (lat1);
    c2 = cos (lat2);
    s1 = sin (dlat / 2);
    s2 = sin (dlon / 2);
    t = s2 * s2 * c1 * c2;
    a = fma (s1, s1, t);
    c = 2 * atan2 (sqrt (a), sqrt (1.0 - a));
    return c * radius;
}

void distance_mp (const struct mpNum *lat1, const struct mpNum *lon1,
                  const struct mpNum *lat2, const struct mpNum *lon2,
                  const struct mpNum *radius, struct mpNum *result)
{
    struct mpNum dlat, dlon, c1, c2, s1, s2, t, a, c;

    mpSub  (lat2, lat1, &dlat);
    mpSub  (lon2, lon1, &dlon);
    mpCos  (lat1, &c1);
    mpCos  (lat2, &c2);
    mpMul  (&dlat, mpHalf(), &dlat);
    mpMul  (&dlon, mpHalf(), &dlon);
    mpSin  (&dlat, &s1);
    mpSin  (&dlon, &s2);
    mpMul  (&s2, &s2, &t);
    mpMul  (&t, &c1, &t);
    mpMul  (&t, &c2, &t);
    mpMul  (&s1, &s1, &a);
    mpAdd  (&a, &t, &a);
    mpSub  (mpOne(), &a, &t);
    mpSqrt (&t, &t);
    mpSqrt (&a, &a);
    mpAtan2(&a, &t, &c);
    mpAdd  (&c, &c, &c);
    mpMul  (&c, radius, result);
}

#if 0 // in case not available on host
__host__ double sincos (double a, double *sptr, double *cptr);
{
    *sptr = sin (a);
    *cptr = cos (a);
}
#endif

/*
  https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
  From: geo <gmars...@gmail.com>
  Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
  Subject: 64-bit KISS RNGs
  Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)

  This 64-bit KISS RNG has three components, each nearly
  good enough to serve alone.    The components are:
  Multiply-With-Carry (MWC), period (2^121+2^63-1)
  Xorshift (XSH), period 2^64-1
  Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64  (kiss64_t = (kiss64_x << 58) + kiss64_c, \
                kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
                kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64  (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
                kiss64_y ^= (kiss64_y << 43))
#define CNG64  (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)

__global__ void kernel (double lat1, double lon1, double lat2, double lon2, 
                        double radius, double *res)
{
#if (VARIANT == HAVERSINE1)
        *res = distance_haversine1 (lat1, lon1, lat2, lon2, radius);
#elif (VARIANT == HAVERSINE2)
        *res = distance_haversine2 (lat1, lon1, lat2, lon2, radius);
#elif (VARIANT == CHORD)
        *res = distance_chord (lat1, lon1, lat2, lon2, radius);
#elif (VARIANT == VINCENTY)
        *res = distance_vincenty (lat1, lon1, lat2, lon2, radius);
#endif
}

int main (void)
{
    mpInit();
    struct mpNum mpLat1, mpLat2, mpLon1, mpLon2, mpRadius, mpDist, mpRes, mpErr;
    const double pi = 3.14159265358979323;
    const double radius = R;
    double lat1, lon1, lat2, lon2;
    double res, ref, err, maxerr = 0;

    printf ("Distance calculation (%s variant) using %d random test vectors\n", 
#if (VARIANT == HAVERSINE1) 
            "haversine 1"
#elif (VARIANT == HAVERSINE2)           
            "haversine 2"
#elif (VARIANT == CHORD)           
            "chord"
#elif (VARIANT == VINCENTY)
            "Vincenty"
#endif
            , TESTS);
    printf ("measuring %s error\n", REL_ERR ? "relative" : "absolute");
    printf ("distance computation on %s\n", USE_GPU ? "GPU" : "CPU");

#if USE_GPU
    double *resd = 0;
    cudaMalloc ((void**)&resd, sizeof resd[0]);
#endif // USE_GPU
          
    mpDoubleToMp (radius, &mpRadius);

    for (int i = 0; i < TESTS; i++) {

        lat1 = ((int64_t)KISS64) * 0x1.0p-63 * pi / 2; // +/- 90 degrees
        lat2 = ((int64_t)KISS64) * 0x1.0p-63 * pi / 2; // +/- 90 degrees
        lon1 = ((int64_t)KISS64) * 0x1.0p-63 * pi;     // +/-180 degrees
        lon2 = ((int64_t)KISS64) * 0x1.0p-63 * pi;     // +/-180 degrees

        mpDoubleToMp (lat1, &mpLat1);
        mpDoubleToMp (lat2, &mpLat2);
        mpDoubleToMp (lon1, &mpLon1);
        mpDoubleToMp (lon2, &mpLon2);
        distance_mp (&mpLat1, &mpLon1, &mpLat2, &mpLon2, &mpRadius, &mpDist);
        mpMpToDouble (&mpDist, &ref);

#if USE_GPU
        kernel<<<1,1>>>(lat1, lon1, lat2, lon2, radius, resd);
        cudaMemcpy (&res, resd, sizeof res, cudaMemcpyDeviceToHost);
#else
#if (VARIANT == HAVERSINE1)
        res = distance_haversine1 (lat1, lon1, lat2, lon2, radius);
#elif (VARIANT == HAVERSINE2)
        res = distance_haversine2 (lat1, lon1, lat2, lon2, radius);
#elif (VARIANT == CHORD)
        res = distance_chord (lat1, lon1, lat2, lon2, radius);
#elif (VARIANT == VINCENTY)
        res = distance_vincenty (lat1, lon1, lat2, lon2, radius);
#else // USE_GPU
#error unsupported VARIANT
#endif
#endif // USE_GPU

        mpDoubleToMp (res, &mpRes);
        mpSub (&mpRes, &mpDist, &mpErr);
#if REL_ERR
        mpDiv (&mpErr, &mpDist, &mpErr);
#endif // REL_ERR
        mpAbs (&mpErr, &mpErr);
        mpMpToDouble (&mpErr, &err);

        if (err > maxerr) {
            maxerr = err;
            printf ("err = %15.8e @ (% 23.16e, % 23.16e), (% 23.16e, % 23.16e) res=%23.16e ref=%23.16e\n",
                    err, lat1, lon1, lat2, lon2, res, ref);
        }
    }
    return EXIT_SUCCESS;
}

system · May 15, 2024, 11:13pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	677	April 4, 2017
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8224	September 9, 2011
Accuracy of 1D linear interpolation by CUDA texture interpolation CUDA Programming and Performance	25	14612	January 29, 2013
Cuda math produces a slightly different result to C++ math CUDA Programming and Performance	11	1316	August 31, 2022
Difference between Device emulation and execution modes CUDA Programming and Performance	8	4435	May 12, 2009
Converting a kernel from floats and ints to halfs is 6x slower CUDA Programming and Performance cuda	14	977	October 16, 2023
Why are the calculations different between CPU and GPU? CUDA Programming and Performance	2	848	February 7, 2020
FMA precision issue CUDA Programming and Performance	9	19347	November 21, 2010
Cuda code performance CUDA Programming and Performance	14	3124	December 16, 2014
CUDA book by Kirk & Whu available CUDA Programming and Performance	44	12108	February 10, 2010

Unacceptably low accuracy of calculations on the GPU

Related topics