Error for parrallelise this c++ code

miyoku · August 18, 2015, 1:29pm

Hi

I’ve tried to create a kernel for replace this c++ code:

for(size_t i = 0; i < i_bus; i++) {
            for(size_t k = 0; k < i_bus; k++) {
                P(i) = P(i) + V(i)* V(k)*(G(i,k)*cos(del(i)-del(k)) + B(i,k)*sin(del(i)-del(k)));
                Q(i) = Q(i) + V(i)* V(k)*(G(i,k)*sin(del(i)-del(k)) - B(i,k)*cos(del(i)-del(k)));
            }
        }

My kernel is the following:

__global__ void computePQ(double* del, double* G, double* B, double* Q, double* P, double* V, int i_bus){
	int tid= blockDim.x*blockIdx.x+threadIdx.x;
	if(tid<i_bus){
		for(int i =0; i<i_bus;i++){
			P[tid]+=V[tid]*V[i]*(G[tid*i_bus+i]*cos( del[tid]-del[i])+B[tid*i_bus+i]*sin( del[tid]-del[i]));
			Q[tid]+=V[tid]*V[i]*(G[tid*i_bus+i]*sin( del[tid]-del[i])-B[tid*i_bus+i]*cos( del[tid]-del[i]));
		}
	}
}

I’ve checked the data of all vector and matrix pass in parameter, they are the same. However i get two differents results of data: This one for the CPU:

-7.83789e+006
 8.15785e+006
 319957
 319957
 319957
 -2.98023e-008
 -8.9407e-008
 0
 319957
 319957
 0
 0
 0
 319957
 0
 0

And this one from CUDA:

8.15785e+006
 -7.83789e+006
 319957
 319957
 319957
 0
 0
 0
 319957
 319957
 0
 0
 0
 319957
 0
 0

I don’t understand why i have similar result but not exactly the same. From my point of view, the kernel is kind of simple.

Thanks for your help.

njuffa · August 18, 2015, 5:24pm

This is impossible to diagnose conclusively because the posted snippet is not buildable and runnable code. The first two results appear to be swapped between CPU and GPU. This may suggest a bug in your code, or it could be due to a small (rounding) error later amplified by a large factor, so a numerical issue.

Absent errors in your code, there are two likely sources of numerical differences between host and device computation here: (1) The CUDA toolchain routinely contracts a floating-point multiplication and a dependent floating-point addition into a single operation called a fused multiply-add (FMA for short). This improves performance and often improves accuracy. Programmers can inhibit the contraction by compiling with -fmad=false. (2) Transcendental functions like sin() and cos() do not, in general, return bit-identical results across different platforms. This is applies to different host platforms as well.

I would suggest reading [url]http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf[/url] as well as all papers that it references.

Topic		Replies	Views
Problem with kernel output CUDA Programming and Performance	5	1202	February 18, 2016
Why are the calculations different between CPU and GPU? CUDA Programming and Performance	2	922	February 7, 2020
CPU and CUDA code yield different results? CUDA Programming and Performance	3	1191	June 28, 2013
Why these two scripts give different results? CUDA Programming and Performance	5	1863	June 19, 2012
different results (cuda\fortran) CUDA Programming and Performance	3	896	July 9, 2013
CUDA Kernel code results differ from CPU results. Legacy PGI Compilers	1	2612	January 5, 2011
weird problem CUDA Programming and Performance	11	5614	November 15, 2010
GPU Code and CPU Code output not matching till machine precision (i.e. 13 decimals places) CUDA Programming and Performance	22	1113	August 9, 2023
Precision Fail CUDA Programming and Performance	5	10586	March 10, 2011
Cube computing difference in GPU and CPU? CUDA Programming and Performance	4	597	November 1, 2017

Error for parrallelise this c++ code

Related topics