Are double precision functions in CUDA MATH API only the copy-paste version of single precision func

Why I cannot see any difference between them? Am I doing anything wrong?

I tested them using the following code, and comparing the results with the value by math.h in C.

__global__ void addKernel(double *dev_c)
{
    dev_c[0] = pow  (2.70134219723423422342334134, 2.70134219723423422342334134);
    dev_c[1] = powf (2.70134219723423422342334134, 2.70134219723423422342334134);

}

int main()
{
	double c[2] = {0.0, 0.0};

	double *dev_c;

	cudaMalloc((void**)&dev_c, 2*sizeof(double));

    addKernel <<< 1, 1 >>> (dev_c);
    
	cudaMemcpy(c, dev_c, 2*sizeof(double), cudaMemcpyDeviceToHost);

	printf("CUDA math double precision: %.24f \n", c[0]);
	printf("CUDA math single precision: %.24f \n", c[1]);

	getchar();
}

The output is

CUDA math double precision:  14.650218963623047000000000
CUDA math single precision:  14.650218963623047000000000

Comparing with the pow() in math.h

C math.h double precision:  14.650221542435155000000000

Or, is there anything I failed to do, that renders CUDA double precision not working?

Many thanks in advance.

What command line are you using to compile the code? If you compile for anything less than an sm_13 architecture, double will get demoted to float. If this is happening, the compiler will usually spit out a message to that effect.

When I compile like this:

nvcc -arch=sm_20 -o t449 t449.cu

and run this code on a cc2.0 device:

#include <stdio.h>
#include <math.h>

__global__ void addKernel(double *dev_c)
{
dev_c[0] = pow (2.70134219723423422342334134, 2.70134219723423422342334134);
dev_c[1] = powf (2.70134219723423422342334134, 2.70134219723423422342334134);

}

int main()
{
double c[2] = {0.0, 0.0};

double *dev_c;

cudaMalloc((void**)&dev_c, 2*sizeof(double));

addKernel <<< 1, 1 >>> (dev_c);

cudaMemcpy(c, dev_c, 2*sizeof(double), cudaMemcpyDeviceToHost);

printf("CUDA math double precision: %.24f \n", c[0]);
printf("CUDA math single precision: %.24f \n", c[1]);
printf("CPU                       : %.24f \n", pow  (2.70134219723423422342334134, 2.70134219723423422342334134));
return 0;
}

I get this:

CUDA math double precision: 14.650221542435154731265357
CUDA math single precision: 14.650218009948730468750000
CPU                       : 14.650221542435154731265357

If, OTOH, I compile the same code like this:

nvcc -o t449 t449.cu

I get output like this:

CUDA math double precision: 14.650218009948730468750000
CUDA math single precision: 14.650218009948730468750000
CPU                       : 14.650221542435154731265357

This still isn’t exactly what you have. I suspect the remaining differences are that you are running on a cc1x GPU and I am running on a cc2x GPU, and in that case there are (I think) library differences between single-precision functions (only). njuffa will come along at some point and straighten it all out. But it will help if you specify your compile command line and GPU.

There also appear to be some printf differences between yours and mine. I am running on linux, perhaps you are running on windows.

You can easily inspect the source code for the CUDA math library by looking at the header files math_functions.h and math_functions_double_ptx3.h that are part of your CUDA installation. You will readily observe that while there are often similarities in the design of single-precision and double-precision version of a given math function, they are not direct copies.

Differences in the results of single-precision math functions between sm_1x and later architectures noted by txbob are primarily due to the availability of FMA (IEEE-754 compliant fused-multiply add) in the latter, while sm_1x only offers a similar, but numerically inferior, FMAD instruction. For some single-precision math functions, the availability of FMA has also prompted a re-design. You can readily observe this in the source code where there are code sections #ifdef’ed based on CUDA_ARCH.

While the availability of FMA has made some single-precision math functions more accurate on sm_20 and later architectures, the error bounds in the CUDA documentation reflect the higher error bound on sm_1x in such cases, that is, the documentation states the worst case errors observed across all supported platforms.

Thank you so much. Yes, that is the problem!

I moved to Visual Studio 2010 with Nsight 4.0 from GCC recently, but forgot it is “compute_10, sm_10” only by default.

I develop code on my laptop (NVS 5400M) and run it on K20.

Hi, njuffa,

I have already got my problem fixed with the help of txbob. Thanks for your suggestions. ^_^