Error for parrallelise this c++ code

Hi

I’ve tried to create a kernel for replace this c++ code:

for(size_t i = 0; i < i_bus; i++) {
            for(size_t k = 0; k < i_bus; k++) {
                P(i) = P(i) + V(i)* V(k)*(G(i,k)*cos(del(i)-del(k)) + B(i,k)*sin(del(i)-del(k)));
                Q(i) = Q(i) + V(i)* V(k)*(G(i,k)*sin(del(i)-del(k)) - B(i,k)*cos(del(i)-del(k)));
            }
        }

My kernel is the following:

__global__ void computePQ(double* del, double* G, double* B, double* Q, double* P, double* V, int i_bus){
	int tid= blockDim.x*blockIdx.x+threadIdx.x;
	if(tid<i_bus){
		for(int i =0; i<i_bus;i++){
			P[tid]+=V[tid]*V[i]*(G[tid*i_bus+i]*cos( del[tid]-del[i])+B[tid*i_bus+i]*sin( del[tid]-del[i]));
			Q[tid]+=V[tid]*V[i]*(G[tid*i_bus+i]*sin( del[tid]-del[i])-B[tid*i_bus+i]*cos( del[tid]-del[i]));
		}
	}
}

I’ve checked the data of all vector and matrix pass in parameter, they are the same. However i get two differents results of data: This one for the CPU:

-7.83789e+006
 8.15785e+006
 319957
 319957
 319957
 -2.98023e-008
 -8.9407e-008
 0
 319957
 319957
 0
 0
 0
 319957
 0
 0

And this one from CUDA:

8.15785e+006
 -7.83789e+006
 319957
 319957
 319957
 0
 0
 0
 319957
 319957
 0
 0
 0
 319957
 0
 0

I don’t understand why i have similar result but not exactly the same. From my point of view, the kernel is kind of simple.

Thanks for your help.

This is impossible to diagnose conclusively because the posted snippet is not buildable and runnable code. The first two results appear to be swapped between CPU and GPU. This may suggest a bug in your code, or it could be due to a small (rounding) error later amplified by a large factor, so a numerical issue.

Absent errors in your code, there are two likely sources of numerical differences between host and device computation here: (1) The CUDA toolchain routinely contracts a floating-point multiplication and a dependent floating-point addition into a single operation called a fused multiply-add (FMA for short). This improves performance and often improves accuracy. Programmers can inhibit the contraction by compiling with -fmad=false. (2) Transcendental functions like sin() and cos() do not, in general, return bit-identical results across different platforms. This is applies to different host platforms as well.

I would suggest reading http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf as well as all papers that it references.