Cube computing difference in GPU and CPU?

15396750637 · November 1, 2017, 9:46am

_arch=sm_20;

I write a simple cuda program for a test about cube computing.The result is strange and i don’t know why.
Here is my whole code:

#include<stdio.h>
#include<stdlib.h>
#include <string.h>
#include <ctype.h>
#include <time.h>
#include<math.h>

 void __global__ test(int N,double *dev_a,double *dev_b)
{
 int i = blockIdx.x * blockDim.x + threadIdx.x;
	if(i>=0 && i<N)
	{
		dev_b[i] = dev_a[i] + dev_a[i]*dev_a[i]*dev_a[i];
	}
}

int main()
{

double *a=NULL,*b=NULL,*b1=NULL;
double *dev_a,*dev_b;
double xrand1,xrand2,tx1;
int N=124;
int i=0,j=0;
int nthreads;

a=(double*)malloc(N*sizeof(double));
b=(double*)malloc(N*sizeof(double));
b1=(double*)malloc(N*sizeof(double));

cudaMalloc((void**)&dev_a, N*sizeof(double));	
cudaMalloc((void**)&dev_b, N*sizeof(double));   

srand(time(0)); 
for(j=1;j<10000;j++)
{

for(i=0; i<N; i++)
{   
        xrand1 = (1.0e0*rand())/RAND_MAX;
        tx1 = xrand1 ;
        a[i]=tx1;
        b[i]=0.0e0;
        b1[i]=0.0e0;
}

cudaMemcpy(dev_a , a, N*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b , b, N*sizeof(double), cudaMemcpyHostToDevice);

  nthreads = 64;
  test<<<(N+nthreads-1)/nthreads, nthreads>>>(N,dev_a,dev_b);


for(i=0;i<N;i++)
{
	b[i] = a[i] + a[i]*a[i]*a[i];

}

cudaMemcpy(b1, dev_b, N*sizeof(double), cudaMemcpyDeviceToHost);

for(i=0; i<N; i++)
{
	if(fabs(b[i]-b1[i])>= 1.0e-16)
	{
		printf("%d,%e\n",i,b[i]-b1[i]);
	}
}
	
	printf("%d\n",j);
	getchar();

}

free(a);
free(b);
free(b1);
cudaFree(dev_a);
cudaFree(dev_b);


return 0;

}

The error always exists.But when i change

dev_b[i] = dev_a[i] + dev_a[i]*dev_a[i]*dev_a[i];

into

dev_b[i] = dev_a[i];
  dev_b[i] += dev_a[i]*dev_a[i]*dev_a[i];

the error will disapper.

Hope someone explain this to me.

Robert_Crovella · November 1, 2017, 10:13am

I assume the “error” you are referring to is a discrepancy between host and device calculations, and that when you make the code change, you get no discrepancy printout.

I’m unable to reproduce that observation (I get discrepancy printout in either case), but I’m not testing on a cc2.0 device and I am on linux, whereas it appears you are probably on windows.

My guess would be a difference in FMA contraction of operations between the two different device computation cases. Probably @njuffa would be able to give a better diagnosis.

15396750637 · November 1, 2017, 10:48am

Thanks for your reply!
I’m so sorry that i make a mistake about the change in my code.And it supposed to be

dev_b[i] = dev_a[i]*dev_a[i]*dev_a[i];
dev_b[i] += dev_a[i];

I am on linux too.When i add the option “-famd=false”,the error disappers.
But i do not know there are some others rules that may affect my result.I appreciate you can read my last topic,that problem is an example.

Robert_Crovella · November 1, 2017, 11:08am

I think you meant -fmad not -famd

Yes, with that code change I can reproduce your observation.

The -fmad=false observation pretty much confirms the discrepancy in this case is due to FMA contraction of operations.

Usually, FMA contraction results in the same or better accuracy, not reduced accuracy, so my guess is that the CPU calculation is actually less accurate, and when you turn off FMA contraction/usage, the GPU result also becomes identically less accurate.

njuffa · November 1, 2017, 3:05pm

While the CUDA toolchain treats floating-point computation conservatively in general (no frequent re-associations that occur by default in various host compilers), it does have FMA contraction turned on by default because the FMA (fused multiply-add) is the central building block of the computational units of the GPU.

As already discussed, FMA contraction can be turned off by adding -fmad=false to the nvcc command line, but this will in general have a negative impact on performance and accuracy. This NVIDIA blog post has a link to an NVIDIA white paper on floating-point topics, including the use of FMA:

[url]https://devblogs.nvidia.com/parallelforall/everything-you-ever-wanted-know-about-floating-point-were-afraid-ask/[/url]

Note that modern CPUs also include an FMA instruction, so if your host compiler supports FMA contraction, you might want to try turning that on. Generally speaking, one should not expect bit-wise matching floating-point results between any two platforms: results can differ for any number of reasons, including different compiler or library versions.

Topic		Replies	Views
Floating point operations IEE compliance and debug mode CUDA Programming and Performance	3	998	April 4, 2013
Why are the calculations different between CPU and GPU? CUDA Programming and Performance	2	853	February 7, 2020
Is there a difference between GPU double precision and CPU double precision? CUDA Programming and Performance	14	10723	November 26, 2009
GPU Code and CPU Code output not matching till machine precision (i.e. 13 decimals places) CUDA Programming and Performance	22	819	August 9, 2023
discrepancy between CPU and GPU after a division (accuracy issue) CUDA Programming and Performance	3	1510	June 10, 2015
CPU and GPU floating point calculations Results are different CUDA Programming and Performance	6	21965	August 7, 2010
Weird result difference between release and debug even with -fmad=false CUDA Programming and Performance cuda	8	645	June 30, 2022
Floating Point Accuracy CUDA Programming and Performance	11	30429	April 6, 2013
Different results in Debug and Release mode compile CUDA Programming and Performance	9	4034	October 20, 2015
Are double precision functions in CUDA MATH API only the copy-paste version of single precision func CUDA Programming and Performance	4	1955	June 28, 2014

Cube computing difference in GPU and CPU?

Related topics