Good afternoon, everyone!
I’ll start, if you don’t mind, with some background.
More than a year ago I learned OpenGL
(together with compute shaders). Not too long ago I needed to port a complex geodesic algorithm to the GPU, as it made sense to do so. I, of course, wrote a very large compute shader, after which I discovered that float
precision is not enough in many scenarios. I implemented my sine, cosine and arctangent functions for double
precision. This worked well in tests. However, when it came to actually using my implementation, it turned out that the drivers crashed my program when the compute shader runtime exceeded a certain threshold (about 1.8 seconds).
Realizing the hopelessness of the situation, I decided to switch to OpenCL
. Having transferred my code to OpenCL, I found out that the results produced by GPU differ from those produced by CPU by 10 decimal places (only 6 first decimal places are the same). I spent about a month on various forums trying to figure out what the problem was. In short, the conclusion was that I had to switch to CUDA because I could NOT control the kernel code compilation process from the OpenCL API.
I learned CUDA
enough to implement my program in CUDA C. Nevertheless, I again encounter exactly the same behavior - the CPU outputs 16 correct decimal places, while the GPU outputs 6 correct and 10 incorrect ones. This behavior does not change no matter what compile options I use during the nvcc
call.
What I’m trying to do: I have a function to check if a point belongs to a pair of arcs of a great circle on the surface of a spherical Earth. This function checks by comparing the parts of the arcs into which the point divides the original arcs. Accordingly, there is a second function that calculates the arc length of the great circle. In the code below, I hope, everything is transparent, so I won’t describe everything in more detail.
#include <stdio.h>
#include <math.h>
__global__ void mainKernelProgram();
__device__ bool pointCheck(double2 start1, double2 end1, double2 start2, double2 end2, double2 point);
__device__ double distanceBetweenCoord(double2 point1, double2 point2);
__global__ void mainKernelProgram() {
double2 start1 = {1.092964656946529223, 1.091676869248603632};
double2 end1 = {0.397907894678253549, 0.070910999632570137};
double2 start2 = {0.892362913180780204, 0.513485940430959964};
double2 end2 = {0.891949846176460226, 0.513485940430959964};
double2 p1v2 = {0.891949793450346418, 0.513485940430929766};
bool pointON = pointCheck(start1, end1, start2, end2, p1v2);
printf("point ON: %s\n", pointON ? "true" : "false");
}
__device__ bool pointCheck(double2 start1, double2 end1, double2 start2, double2 end2, double2 point)
{
double totalDist1 = distanceBetweenCoord(start1, end1);
double distBefore1 = distanceBetweenCoord(start1, point);
double distAfter1 = distanceBetweenCoord(point, end1);
double totalDist2 = distanceBetweenCoord(start2, end2);
double distBefore2 = distanceBetweenCoord(start2, point);
double distAfter2 = distanceBetweenCoord(point, end2);
printf("NV CUDA totalDist1 = %.30e\n", totalDist1);
printf("NV CUDA distBefore1 = %.18e\n", distBefore1);
printf("NV CUDA distAfter1 = %.18e\n", distAfter1);
printf("NV CUDA totalDist2 = %.18e\n", totalDist2);
printf("NV CUDA distBefore2 = %.18e\n", distBefore2);
printf("NV CUDA distAfter2 = %.18e\n", distAfter2);
double a = totalDist1 - distBefore1 - distAfter1;
double b = totalDist2 - distBefore2 - distAfter2;
double c = fabs(a - b);
if (fabs(a) > 0.001)
return false;
if (fabs(b) > 0.001)
return false;
if (c < 0.001)
return true;
else
return false;
}
__device__ double distanceBetweenCoord(double2 point1, double2 point2)
{
double R = 6371110.0;
double phi1 = point1.x;
double lambda1 = point1.y;
double phi2 = point2.x;
double lambda2 = point2.y;
double delta_lambda = lambda2 - lambda1;
double cosphi1 = cos(phi1);
double cosphi2 = cos(phi2);
double sinphi1 = sin(phi1);
double sinphi2 = sin(phi2);
double cosdeltalambda = cos(delta_lambda);
double p1 = sin(delta_lambda) * cosphi2;
double p2 = cosphi1 * sinphi2 - sinphi1 * cosphi2 * cosdeltalambda;
// double p2_h1 = cosphi1 * sinphi2;
// double p2_h2 = sinphi1 * cosphi2 * cosdeltalambda;
// double p2 = p2_h1 - p2_h2;
double q = sinphi1 * sinphi2 + cosphi1 * cosphi2 * cosdeltalambda;
// double q_h1 = sinphi1 * sinphi2;
// double q_h2 = cosphi1 * cosphi2 * cosdeltalambda;
// double q = q_h1 + q_h2;
double res = fabs(atan2(sqrt(p1 * p1 + p2 * p2), q) * R);
// double res_spqr = sqrt(p1 * p1 + p2 * p2);
// double res_atan2 = atan2(res_spqr, q);
// double res_atan2_R = res_atan2 * R;
// double res = fabs(res_atan2_R);
return res;
}
int main(void)
{
mainKernelProgram<<<1, 1>>>();
return 0;
}
What the output of a CPU program looks like:
CPU totalDist1 = 6.177021452302754857e+06
CPU distBefore1 = 2.346662786588795017e+06
CPU distAfter1 = 3.830358665713958908e+06
CPU totalDist2 = 2.631695321893054370e+03
CPU distBefore2 = 2.632031245838992618e+03
CPU distAfter2 = 3.359239459231472269e-01
point ON: false
And now what the output of the GPU program looks like:
NV CUDA totalDist1 = 6.177021452302754857e+06
NV CUDA distBefore1 = 2.346662786588686518e+06
NV CUDA distAfter1 = 3.830358665714067407e+06
NV CUDA totalDist2 = 2.631695321893054370e+03
NV CUDA distBefore2 = 2.632031245764014784e+03
NV CUDA distAfter2 = 3.359238709455678040e-01
point ON: false
You can see that the values of the first variable (totalDist1) match completely, which is great because it highlights my problem perfectly. The values of the second variable (distBefore1) match at only 13 decimal places, and that’s not bad in terms of the required precision. However, the last variables (distAfter2) match by only 6 decimal places.
In this particular example, these six decimal places are not critical, but the values calculated by the distanceBetweenCoord()
function are also used in other places of the program, where the accuracy will drop even more, and six correct decimal places will not be enough.
What have I tried to do? Lots of things. I have divided complex operations into parts (commented lines in the code), tried various nvcc
flags, but all this has no effect.
Example:
nvcc -x cu main.cpp
NV CUDA distAfter2 = 3.359238708155382058e-01
nvcc -x cu main.cpp --fmad=false --gpu-architecture=compute_89 -ftz=false -prec-div=true -prec-sqrt=true
NV CUDA distAfter2 = 3.359238709455678040e-01
The second value is better than when compiling without flags, but it is equally wrong globally.
When I was on the OpenCL forums, I was told that the GPU physically cannot count as accurately as the CPU due to the lack of a math coprocessor on the GPU. However, this does not match reality. As I mentioned, the values of the first variable match completely in the bitwise representation (I output many decimal places via printf()
and they all matched, which I think indicates that the bitwise representation was the same). Hence, the GPU CAN calculate as accurately as the CPU. But it won’t.
Actually, this is my question - how to make the program, first, make all calculations with the same accuracy (instead of 16/13/…/6 matching digits), and, second, make the calculations match the CPU version?
Thank you so much for any help you can give!
P.S. I’m using CUDA 12.4 along with an RTX 4080. Realizing that this is a gaming GPU, I tested the program on our NVidia A100 running CUDA 12.1. There are no differences. It’s clearly not about game/computing optimized drivers.
P.P.S. I’d also like to point out that in terms of the problem I’m describing, the computational shader (with my redefined trigonometry) works much better. In addition to the fact that the accuracy there does not “jump”, the values are 5-10 times more accurate.