I’m new to CUDA programing, but I can’t seem to get my CUDA 2.1 capable (Quadro NVS 4200M) card to support double precision math.
To compile I’ve been using: nvcc test.cpp -o test -arch=sm_13
I have compiled with both -arch=sm_13 and -arch=sm_20, but a simple program (below) shows a loss of precision in the CUDA kernel. I do not get any warning at compile time of demotion of double to float.
What other compile flags are required to support double precision math?
Here is a simple example (loop of 50 multiplications) where the precision between a calculation performed on the Linux (ubuntu 10.04) host is different than the precision of the CUDA test kernel I wrote.
Thanks so much for the help!
double a = 1.112321232123212223432;
double b = 1.234323334323343234323;
double c = 1.0;
// Host Calculation
for(int i=0; i<50; ++i) c *= (a * b);
// CUDA Calculation global void multiplyLoop(double a, double b, double* c)
{
if(threadIdx.x==0 && blockIdx.x==0)
{
*c = 1.0;
for(int i=0; i<50; ++i) *c *= (a * b);
}
}
You seem to be assuming that because the results don’t agree to whatever arbitrary tolerance you have chosen, the GPU must be using single precision arithmetic. I would hazard a guess that your assumption is incorrect – the compilation arguments you are using will generate double precision code on the GPU. The only way to determine the relative accuracy of a pair of double precision results is to compare them to a result made with more bits of precision (128 bit floating point, for example). You might be surprised at the results.
Is there some set of nvcc flags I can use that will mimic a Linux host performing a double precision math operation? I assume my linux host is using the IEEE double precision floating point format (exponent and fraction bits, rounding, etc).
I think my question comes down to - why are calculations different, and what can I do to make them the same?
The assumption that “my linux host is using the IEEE double precision floating point format” actually may not apply. On 32-bit x86 Linux systems in particular, floating-point computations are typically performed via the x87 FPU, with the precision control set to extended precision. This means that all in-register computation uses a higher precision than double precision (64 vs 53 mantissa bits). Since the code here is very simple, it is likely that all intermediate results are kept in extended precision in that case.