different results (cuda\fortran)

greepz · July 8, 2013, 11:21am

Hello

I have a source code.

module test_gpu
	   use cudafor
	   contains

	   attributes(global) subroutine force_gpu(q,UEV1_m, block,koef,m3)
	   real,shared :: bsum(32)
	   real :: q(96),UEV1_m(block)
	   integer, value :: block,m3
	   real,value :: koef
	   integer :: i,tindex,tneighbor
	   
	   
	   i = (blockidx%x-1)*blockdim%x + threadidx%x
	   
	   if (i .LE. m3) then
	   Ud_i=0
	   DO 3 J=1,m3
	   if (i.ne.j) then
	   aq=koef/2
	   Udj=aq
	   Ud_i= Ud_i + Udj
	   endif
	   call syncthreads()
3      continue
       UEV1   =UEV1   + Ud_i

	   end if
	   tindex = threadIdx%x
	   bsum(tindex) = UEV1
	   call syncthreads()
	   tneighbor=16
	   do while( tneighbor >= 1 )
	   if(tindex<=tneighbor) then
	   bsum(tindex)=bsum(tindex)+bsum(tindex+tneighbor)
	   endif
	    tneighbor = tneighbor / 2
       call syncthreads()
	   end do
	   if( tindex == 1 ) then
	   UEV1_m(blockidx%x) = bsum(1)
	   endif
	   call syncthreads()
	   
	   
	   endsubroutine
	   endmodule


	  program test
	  use test_gpu
	  real,device :: q_d(96)
	  real :: koef
	  real :: q(96)
	  integer :: block
	  real,device,allocatable :: UEV1_m(:)
	  real,allocatable :: UEV1_cp(:)
	  koef = 1.30462647E-04
	  m3=96
      do 10 z = 1, 5000
	  block=(m3/32)+1
	  UEV1=0
	  UEV2=0
	  allocate(UEV1_m(block),UEV1_cp(block))

	  call force_gpu <<<block,32>>>(q_d,UEV1_m,block,koef,m3)
	  cudasync = cudaThreadSynchronize()
	  UEV1_cp = UEV1_m
	  do i = 1,block
	  UEV1=UEV1+UEV1_cp(i)
	  enddo
	  deallocate(UEV1_m,UEV1_cp)

	  do 2 i = 1,m3
	   Ud_i=0
	    DO 3 J=1,m3
		if (i.ne.j) then
	   aq=koef/2
	   Udj=aq
	   Ud_i= Ud_i + Udj
	   endif
3      continue
       UEV2   =UEV2   + Ud_i
2	  continue
10      continue
	  write (*,*), UEV1
	  write (*,*), UEV2
	  end program

Results:
UEV1 = 0.5949091
UEV2 = 0.5949095

I do not understand why the results are different?

seibert · July 8, 2013, 3:11pm

Floating point addition is not associative, so if you add terms in a different order (which frequently happens when an algorithm is translated to the GPU), you get slightly different answers. Compiler and architecture differences, such as the use of fused multiply-add instructions, can also make slight changes.

You should take a look at this whitepaper for more details:

https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

njuffa · July 8, 2013, 4:44pm

In addition to reading the whitepaper, I would recommend to also have a look at the references cited by the whitepaper.

Note that differences due to non-associative floating-point addition can occur even on the host, for example when a summation is split into multiple partial sums that are combined at the end to allow for SIMD vectorization. In these cases results can differ between builds using full optimization that includes SIMD vectorization and builds without optimization. Depending on the sign and magnitude of the summands, summation order can have a significant impact on results.

If there is a desire to have a sum computed very accurately, for example to minimize differences between platforms, you might want to look into using a compensated sum (see the Wikipedia article on “Kahan summation”, for example).

greepz · July 9, 2013, 6:08am

Thanks for the answers

Topic		Replies	Views
Different results using diffrent memory types CUDA Programming and Performance	22	4607	April 14, 2010
strange floating-point accuracy prolem CUDA Programming and Performance	2	743	July 18, 2011
Accumulation Difference between CUDA and CPU? CUDA Programming and Performance	7	10591	August 22, 2009
Why these two scripts give different results? CUDA Programming and Performance	5	1862	June 19, 2012
float asssociative Debugging error CUDA Programming and Performance	10	2331	April 12, 2010
The Arithmetic difference in fortran cuda Legacy PGI Compilers	1	1790	March 27, 2012
Error for parrallelise this c++ code CUDA Programming and Performance	1	609	August 18, 2015
Problem with kernel output CUDA Programming and Performance	5	1179	February 18, 2016
Problem with CUDA fortran simple program Legacy PGI Compilers	4	9162	February 11, 2010
CPU and CUDA code yield different results? CUDA Programming and Performance	3	1186	June 28, 2013

different results (cuda\fortran)

Related topics