Double Precision Help... Double precision

I’m new to CUDA programing, but I can’t seem to get my CUDA 2.1 capable (Quadro NVS 4200M) card to support double precision math.

To compile I’ve been using: nvcc test.cpp -o test -arch=sm_13

I have compiled with both -arch=sm_13 and -arch=sm_20, but a simple program (below) shows a loss of precision in the CUDA kernel. I do not get any warning at compile time of demotion of double to float.

What other compile flags are required to support double precision math?

Here is a simple example (loop of 50 multiplications) where the precision between a calculation performed on the Linux (ubuntu 10.04) host is different than the precision of the CUDA test kernel I wrote.

Thanks so much for the help!

double a = 1.112321232123212223432;
double b = 1.234323334323343234323;

double c = 1.0;

// Host Calculation
for(int i=0; i<50; ++i) c *= (a * b);

// CUDA Calculation
global void multiplyLoop(double a, double b, double* c)
{
{
*c = 1.0;
for(int i=0; i<50; ++i) *c *= (a * b);
}
}

You seem to be assuming that because the results don’t agree to whatever arbitrary tolerance you have chosen, the GPU must be using single precision arithmetic. I would hazard a guess that your assumption is incorrect – the compilation arguments you are using will generate double precision code on the GPU. The only way to determine the relative accuracy of a pair of double precision results is to compare them to a result made with more bits of precision (128 bit floating point, for example). You might be surprised at the results.

Yes, your right, I am making that assumption.

Is there some set of nvcc flags I can use that will mimic a Linux host performing a double precision math operation? I assume my linux host is using the IEEE double precision floating point format (exponent and fraction bits, rounding, etc).

I think my question comes down to - why are calculations different, and what can I do to make them the same?

Thanks again!

Try compiling and running this code:

``````#include <assert.h>

#include <stdio.h>

__global__ void multiplyLoop(const double a, const double b, double* c)

{

*c = 1.0;

for(int i=0; i<50; ++i) *c *= (a * b);

}

int main(void)

{

const double a = 1.112321232123212223432;

const double b = 1.234323334323343234323;

double c = 1.0;

const long double aref = (long double)a;

const long double bref = (long double)b;

long double cref = (long double)c;

for(int i=0; i<50; ++i) {

c *= (a * b);

cref *= (aref * bref);

}

double * cgpu_, cgpu;

assert( cudaMalloc((void **)&cgpu_, sizeof(double)) == cudaSuccess);

multiplyLoop<<<1,1>>>(a,b,cgpu_);

assert( cudaPeekAtLastError() == cudaSuccess);

assert( cudaMemcpy(&cgpu, cgpu_, sizeof(double),

cudaMemcpyDeviceToHost) == cudaSuccess);

fprintf(stdout, "%.32LE,\n%.16E,\n%.16E\n", cref, c, cgpu);

return 0;

}
``````

like this:

``````avidday@cuda:~\$ nvcc -arch=sm_20 floating.cu -o floating

avidday@cuda:~\$ ./floating

7.63762478400567422613676171749830E+06,

7.6376247840056578E+06,

7.6376247840056578E+06
``````

and see what it does.

The assumption that “my linux host is using the IEEE double precision floating point format” actually may not apply. On 32-bit x86 Linux systems in particular, floating-point computations are typically performed via the x87 FPU, with the precision control set to extended precision. This means that all in-register computation uses a higher precision than double precision (64 vs 53 mantissa bits). Since the code here is very simple, it is likely that all intermediate results are kept in extended precision in that case.

You should try disabling extended precision with the compile flag -ffloat-store.

Also as others have pointed out different!=worse.

Thanks guys - this information was super helpful! I still have much to learn