trigonometric functions standard c v/s cuda

Hi all… I’m trying to calculate an astronomic center of an image and I realized that the trigonometric functions of the math.h of CUDA and C standard don’t have the same precision.

What can I do to use the C standard math library in CUDA?

Here is my code.

float lobs, mobs;
	double raimage = ra * RPDEG;
	double decimage = dec * RPDEG;
	direccos(obsra, obsdec, raimage, decimage, &lobs,  &mobs);
	//printf("(l,m)=(%.30lf,%.30lf)\n", lobs, mobs);
	//printf("deltax = %.30f\n", deltax);
	global_xobs = (crpix1 - 1.0) - lobs/deltax;
	global_yobs = (crpix2 - 1.0) + mobs/deltay;

And

__host__ void direccos(double ra, double dec, double ra0, double dec0, float* l, float* m)
  {
    double delta_ra = ra - ra0;
    double cosdec = cos(dec);
    *l = cosdec * sin(delta_ra);
    *m = sin(dec) * cos(dec0) - cosdec * sin(dec0) * cos(delta_ra);
  }

I’m using CUDA 7.0 and a Tesla K80, only the half.

l should be something near 255.0 and m 256.0

But my code returns global_xobs = 252.570236 and global_yobs=259.263550

Makefile

CUFFTFLAG = -lcufft
CFLAGS = -c -w
INC_DIRS = -Iinclude
CFFLAG = -Llib -lcfitsio -lm -lsqlite3
LDFLAGS = -lcuda -lcudart
ARCHFLAG = -arch=sm_30

main: build/main.o build/functions.o build/directioncosines.o build/f1dim.o build/mnbrak.o build/brent.o build/linmin.o build/frprmn.o
	@ echo "Linking CUDAMEM"
	@ nvcc $(LDFLAGS) build/*.o -o bin/mem.run $(CFFLAG) $(CUFFTFLAG) $(ARCHFLAG)
	@ echo "The compilation has been completed successfully"

build/main.o: src/main.cu
	@ echo "Building Main"
	@ nvcc $(CFLAGS) $(INC_DIRS) $(LDFLAGS) src/main.cu -o build/main.o $(CFFLAG) $(CUFFTFLAG) $(ARCHFLAG)

build/functions.o: src/functions.cu
	@ echo "Building Functions"
	@ cd cfitsio; make; cp libcfitsio.a ../lib/.
	@ nvcc $(CFLAGS) $(INC_DIRS) $(LDFLAGS) src/functions.cu -o build/functions.o $(CFFLAG) $(CUFFTFLAG) $(ARCHFLAG)

build/directioncosines.o: src/directioncosines.cu
	@ echo "Building directioncosines"
	@ nvcc $(CFLAGS) $(INC_DIRS) $(LDFLAGS) src/directioncosines.cu -o build/directioncosines.o $(CFFLAG) $(CUFFTFLAG) $(ARCHFLAG)

build/f1dim.o: src/f1dim.cu
	@ echo "Building f1dim"
	@ nvcc $(CFLAGS) $(INC_DIRS) $(LDFLAGS) src/f1dim.cu -o build/f1dim.o $(CFFLAG) $(CUFFTFLAG) $(ARCHFLAG)

build/mnbrak.o: src/mnbrak.cu
	@ echo "Building mnbrak"
	@ nvcc $(CFLAGS) $(INC_DIRS) $(LDFLAGS) src/mnbrak.cu -o build/mnbrak.o $(CFFLAG) $(CUFFTFLAG) $(ARCHFLAG)

build/brent.o: src/brent.cu
	@ echo "Building brent"
	@ nvcc $(CFLAGS) $(INC_DIRS) $(LDFLAGS) src/brent.cu -o build/brent.o $(CFFLAG) $(CUFFTFLAG) $(ARCHFLAG)

build/linmin.o: src/linmin.cu
	@ echo "Building linmin"
	@ nvcc $(CFLAGS) $(INC_DIRS) $(LDFLAGS) src/linmin.cu -o build/linmin.o $(CFFLAG) $(CUFFTFLAG) $(ARCHFLAG)

build/frprmn.o: src/frprmn.cu
	@ echo "Building frprmn"
	@ nvcc $(CFLAGS) $(INC_DIRS) $(LDFLAGS) src/frprmn.cu -o build/frprmn.o $(CFFLAG) $(CUFFTFLAG) $(ARCHFLAG)

From the incomplete snippets above it is impossible to tell what the results should be mathematically, and how far off from the mathematical result the CPU and the GPU results are, respectively. In general, double precision sin() and cos() in CUDA have a maximum error < 1.5 ulps across the entire input domain, so no significant difference to host library implementations are expected. Your problem may be elsewhere.

I would be happy to look into the details for you if you could post a small, complete, self-contained, buildable code that demonstrates the issue. In addition, please post the exact nvcc command line you are using to compile your code, as well as information on your host platform.

Side remark: It looks like your computation may involve degree-to-radian conversion. If so, you can improve the performance and accuracy of your code by using the sinpi() and cospi() functions. Instead of calling sin (PI/180.0 * x) you would call sinpi (x / 180.0), for example.

For purposes of illustration, the code below shows the use of sinpif() and cospif() in a situation where one commonly encounters degree-based, rather than radian-based, data.

/* This function computes the great-circle distance of two points on earth 
   using the Haversine formula, assuming spherical shape of the planet. A 
   well-known numerical issue with the formula is reduced accuracy in the 
   case of near antipodal points.
 
   lat1, lon1  latitude and longitude of first point, in degrees [-90, +90]
   lat2, lon2  latitude and longitude of second point, in degrees [-180, +180]
   radius      radius of the earth in user-defined units, e.g. 6378.2 km or 
               3963.2 miles
 
   returns:    distance of the two points, in the same units as radius
 
   Reference: http://en.wikipedia.org/wiki/Great-circle_distance
*/
__device__ float haversine (float lat1, float lon1, float lat2, float lon2,
                            float radius)
{
    float dlat, dlon, c1, c2, d1, d2, a, c, t;
 
    c1 = cospif (lat1 / 180.0f);
    c2 = cospif (lat2 / 180.0f);
    dlat = lat2 - lat1;
    dlon = lon2 - lon1;
    d1 = sinpif (dlat / 360.0f);
    d2 = sinpif (dlon / 360.0f);
    t = d2 * d2 * c1 * c2;
    a = d1 * d1 + t;
    c = 2.0f * asinf (fminf (1.0f, sqrtf (a)));
    return radius * c;
}

I’ve updated the post with the makefile, I will post a buildable and runnable code in a few minutes.

Thanks!

If you could limit the reproducer to one single .cu file with no external dependencies like SQLlite, which can be compiled with a single invocation of nvcc, that would be extremely helpful. Given the small size of your snippets previously posted, it should be straightforward to cut down the code accordingly.

Not that it should matter here, but the K80 is CC 3.7, while you seem to be building for CC 3.0. Is there any particular reason for this?

Little example with bad results:

directioncosines.cu (285 Bytes)
main.cu (662 Bytes)

directioncosines.cuh

#ifndef DIRECTIONCOSINES_CUH
  #define DIRECTIONCOSINES_CUH
  #include <math.h>
  #include <stdio.h>
  #include <cuda.h>
  #include "cuda_runtime.h"
  #include "device_launch_parameters.h"
  #include "math_constants.h"
  #include <stdlib.h>

  __host__ void direccos(double ra, double dec, double ra0, double dec0, double* l, double* m);
  #endif

[1] The two .cu files you posted in #6 include directioncosines.cuh, which you did not provide
[2] The two .cu files contain host code only, which is passed to the host compiler by nvcc. There is no device code present, thus CUDA’s standard math library is not invoked.
[3] The code is missing #include <math.h>

Using two different host compilers, set up for strict IEEE-754 compliance, the results printed are:

xobs: 252.941360, yobs:259.263550

Note that your computation suffers massively from subtractive cancellation in the computation of ‘m’:

m = -4.978186705788519e-1 - (-4.978189870203392e-1) = 3.16441487e-7

This means any rounding errors accumulated up to the computation of ‘m’ will be magnified by a factor of about one million. I don’t immediately recognize the computation being performed here, but there is probably an alternative formula that avoids the subtractive cancellation. In any event this is an issue you would want to address.

Note that if you compute ‘m’ on the device using CUDA, there could be numerical differences with the host computation since the device code will likely use an FMA (fused multiply-add) instruction when computing the difference of the two products. If you compile with -fmad=false, FMA generation is suppressed, but this may actually make the results less accurate.

Yeah I know that all the code is in host, the thing is that I have another code, compiled with gcc, and with the same parameters for the directioncosines function and the results are:

xobs: something near 255.0 and yobs: something near 256.0

And sorry for the code error, the forum is giving me problems to upload them in a correct way.

Also, this is only one problem of many that I think my code has, but I think is good if we go slow and function by function.

I’m not compiling with the -fmad option, so the value is default, which is true isn’t it?

I don’t have access to your “other code”, I am basing my analysis on the data and code you provided above. The CUDA math library is not being invoked by this code, therefore your issues must be in the host code. I used both MSVC and the Intel compiler (configured for maximum adherence to IEEE-754) as the host compiler and both delivered the same results.

Indications so far are that your computation is numerically ill-conditioned. This means that small differences caused by the use of different host compilers, host libraries, and compiler settings can be amplified significantly, causing large differences in the final results. Note that your code has also subtractive cancellation in the computation:

delta_ra = ra - ra0

For what it is worth, I ran the computation through a high-precision library computing with 30 decimal places and got the following results:

l=1.9961106845038E-7
m=3.164414874119352E-7

The values of the variables computed by your posted code are l=1.99611068e-7 and m=3.16441487e-7, so they seem to be accurate to wihtin the limitations of the ‘float’ data type.

It is possible that the host compiler run initiated by nvcc uses a host compiler setting that is different from your other compilation, potentially causing tiny differences upstream of the posted code that are then amplified. You may want to partition your code base so all host code is directly compiled by the host compiler, rather than via nvcc’s invocation of the host compiler.

Is your “other code” running on a 32-bit x86 Linux system by any chance? If so, the numerical differences you are observing could be due to the fact that, by default, such 32-bit x86 Linux systems use the x87 FPU for floating-point computation by default, and set the FPU precision control to extended precision. This means that many intermediate operation are performed at higher than double precision (extended precision provides approximately three additional decimal places).

By contrast, modern x64 Linux platforms use the SSE units for floating-point computation, which causes all double-precision computation to be performed in actual double precision.

Given a code base that seems to have multiple instances of substantial subtractive cancellation (just from the few lines posted above, which has two such instances), this could easily lead to fairly sizable differences in the final results due to error magnification stemming from cancellation. I do not know in what context these “direction cosines” are being computed, but you may want to look out for issues with almost collinear or coplanar entities, as those will easily lead to ill-conditioned computations.

Then I would have to change the extension for the directioncosines.cu and directioncosines.cuh to directioncosines.c and directioncosines.h, respectively. After do that the line to compile that file it would be this?

build/directioncosines.o: src/directioncosines.c
	@ echo "Building directioncosines"
	@ gcc $(CFLAGS) $(INC_DIRS) src/directioncosines.c -o build/directioncosines.o

Am I correct?

The context:

  • Convert (ra,dec) to direction cosines (l,m) relative to
  • phase-tracking center (ra0, dec0). All in radians.
  • Reference: Synthesis Imaging in Radio Astronomy II, p.388.

Both codes running in x64 Linux platforms :/

You can use either .c or .cpp depending on your preference for C or C++. Since CUDA is basically a subset of C++, .cpp may be more appropriate.

If both ypur platforms are x64 and compilation is configured to use SSE (default on x64), I do not know where the differences would originate. I would suggest to look at the two machines side by side and find the earliest place in the code where there are numerical differences. The trigonometric functions used in the code you posted do not appear to have anything to do with the issue; as I stated earlier I suspect minor differences come into existence upstream of the posted code and get merely amplified here due to substantial subtractive cancellation.

I am looking at the reference you cited right now, but this is really only tangential to my expertise, so I can’t promise to find any substantial numerical improvements to the algorithm. It is interesting to note that the paper specifically addresses “non-coplanar” arrays, while your sample test case seems to suggest near coplanarity. Any insights into that?

For step by step comparison of intermediate results for the posted code and the “other program”, here are the intermediate results, computed with strict double-precision arithmetic and high-quality math library:

inputs to directcos() are:
ra=-2.1088019999999998e+000
dec=-7.3867799999999995e-001
ra_image=-2.1088022699784292e+000
dec_image=-7.3867831644150550e-001

delta_ra=2.6997842939024963e-007
cosdec=7.3935932178436192e-001
sin(delta_ra)=2.6997842939024635e-007
sin(dec)=-6.7331106725685741e-001
cos(dec0)=7.3935910872075716e-001
sin(dec0)=-6.7331130122080063e-001
cos(delta_ra)=9.9999999999996358e-001

sin(dec)cos(dec0)=-4.9781867057885187e-001
cosdec
sin(dec0)*cos(delta_ra)=-4.9781898702033922e-001

l=1.99611068e-007 // float
m=3.16441487e-007 // float

xobs: 252.941360, yobs:259.263550