Result of reduction in GPU do not match with the CPU's, also GPU's result vary with blocksize

Hello,
I am performing a summation reduction on the GPU using CUDA. The GPU gives a different value from the CPU and it also differ when I change the block size of the kernel launch. Here is the code in fortran,

    module test
      implicit none
      integer, parameter :: dp = selected_real_kind(15,307)

      integer :: n
      real (dp), allocatable, dimension(:) :: Vec, tmp, Vec1

      integer, device :: n_d
      real (dp), device, allocatable, dimension(:) :: Vec_d, tmp_d

    contains
      attributes(global) subroutine reduction()
        implicit none
        integer :: i, tid, stride
       !real (dp) :: tmp
        real (dp), shared :: sdata(blockDim%x)

        i  = (blockIdx%x-1)*blockDim%x + threadIdx%x
        tid= threadIdx%x

       !sdata(tid) = 0.
        sdata(tid) = Vec_d(i)

        stride = blockDim%x/2
        do while (stride>=1)
           call syncthreads()
           if (tid<=stride)  sdata(tid) = sdata(tid) + sdata(tid + stride)
           
           stride = stride/2
        end do
        call syncthreads()

        if (tid==1)    tmp_d(blockIdx%x) = sdata(1)
      end subroutine reduction
    end module test

    program main
      use test
      implicit none

      integer :: i, threads, blocks
      real (dp):: total

      n=438048
      allocate(Vec(n))
      allocate(Vec_d(n))

      open(11, file='vector.dat')
      do i=1,n
         read(11,*) Vec(i)
      end do
      close(11)

      Vec_d=Vec

      threads=256
      blocks= (n-1)/threads + 1

      allocate(tmp(blocks))
      allocate(tmp_d(blocks))

      call reduction<<<blocks, threads, threads*8, 0 >>>()

      tmp=tmp_d
        
      total = 0.
      do i=1,blocks
         total = total + tmp(i)
      end do

      print*, 'sum gpu: ', total

      total = 0.
      do i=1,n
         total = total + Vec(i)
      end do

      print*, 'sum cpu: ', total
      print*, 'Blocks: ', blocks
      print*, 'Threads: ', threads

    end program main

The input vector is extracted from a different program, uploaded here,
(https://mega.nz/file/5r4F1AYA#DcnxBHbWUqXSUIu9B2nnGj8iCmV2A31pWq71xycHe9c)

The sample output that I’ve got, (compiled by: pgfortran -Mr8 code.cuf)

    sum gpu:    9.9229944015915367E+028
    sum cpu:    9.9229944015916106E+028
    Blocks:          3423
    Threads:           128

    sum gpu:    9.9229944015915402E+028
    sum cpu:    9.9229944015916106E+028
    Blocks:          6845
    Threads:            64

    sum gpu:    9.9229944015915438E+028
    sum cpu:    9.9229944015916106E+028
    Blocks:         13689
    Threads:            32

The results are from Tesla V100 and Intel Xeon CPU. Why are the results different?
Any advice would be appreciated.
Thank you.

Hi DarkShadoW,

Why are the results different?

Short answer: rounding error and order of operations.

Given pointing point numbers can not be precisely represented, floating point operations do have a small amount of rounding error. For reductions, the round error gets amplified as the number of values accumulated increases.

Secondly, the order in which the operations are performed is important. For the CPU since you’re running this sequentially, the same order is used and hence the same accumulated error is computed. When run in parallel, the values are accumulated in a different order, hence have different rounding, so the accumulated error is different.

Comparisons like this are not expected to yield the exact same results and instead should be viewed within a tolerance level. The absolute tolerance is quite high, but only because the resulting values are very large. However the relative tolerance is actually quite low.

Hope this helps,
Mat

Hi mkcolg,

I would like to thank you for making such a clear and lucid explanation about the error related to floating point representation and how the order of operations matters. However I would like to ask some additional pointers on the issue.

Since the absolute error accumulation is quite large in parallel, the original algorithm branches out to different paths unless the block size is set lower like 32 or 64 threads. If I simply copy the vector to the CPU and then sum it, the algorithm works perfect, but it suffers a lot in the speedup due to the large sized copy and serial addition operation. So are there any other way to accomplish this in the GPU and not lose the speedup?

Thanks

Hi DarkShadoW,

Keep in mind that the issue here is not specific to a GPU. Performing a reduction in parallel, either on a GPU or CPU will yield slightly different result.

Also, the CPU result is “wrong” as well, in that there’s still rounding error that gets accumulated. The real question you’re asking is how to I get consistent answer between a parallel and serial reduction? And the answer is that you can’t guarantee this, at least not with this basic reduction. You may do a web search and find papers on alternate specialized algorithms, but the typically way of mitigating this is to increase the precision, but since your already using double precision and quad precision is not supported, you’ll need to decide which “wrong” answers, parallel or serial, is more correct.

To help illustrate this, I reworked your code to use OpenACC instead of CUDA Fortran so I could compile it to run on either a multi-core CPU or GPU, and run it either serially or in parallel.

    program main
      use test
      implicit none

      integer :: i
      real (dp):: ptotal,stotal,ctotal

      n=438048
      allocate(Vec(n))
      open(11, file='vector.dat')
      do i=1,n
         read(11,*) Vec(i)
      end do
      close(11)

      ptotal=0.
!$acc parallel loop reduction(+:ptotal)
      do i=1,n
         ptotal = ptotal + Vec(i)
      end do
      write(*,'(A20,F40.1)'), 'sum parallel:', ptotal

      stotal = 0.
!$acc serial loop
      do i=1,n
         stotal = stotal + Vec(i)
      end do
      write(*,'(A20,F40.1)'), 'sum serial:', stotal

      ctotal = 0.
      do i=1,n
         ctotal = ctotal + Vec(i)
      end do

      write(*,'(A20,F40.1)'), 'sum cpu:', ctotal

      print *, "ABS Tolerance:"
      write(*,'(A20,F40.1)'), 'cpu vs. parallel: ', abs(ctotal-ptotal)
      write(*,'(A20,F40.1)'), 'cpu vs. serial: ', abs(ctotal-stotal)
      print *, "REL Tolerance:"
      write(*,'(A20,F40.38,A)'), 'cpu vs. parallel: ', 100*(abs(ctotal-ptotal) / min(abs(ctotal),abs(ptotal))),'%'
      write(*,'(A20,F40.38,A)'), 'cpu vs. serial: ', 100*(abs(ctotal-stotal) / min(abs(ctotal),abs(stotal))),'%'

    end program main

First, let’s compile it for the GPU:

% nvfortran -ta=tesla test3.f90 ; a.out
       sum parallel:         99229944015915560813553778688.0
         sum serial:         99229944015916106171321155584.0
            sum cpu:         99229944015916106171321155584.0
 ABS Tolerance:
  cpu vs. parallel:                        545357767376896.0
    cpu vs. serial:                                      0.0
 REL Tolerance:
  cpu vs. parallel: 0.00000000000054958991742394378867308608%
    cpu vs. serial: 0.00000000000000000000000000000000000000%

As you can see, the serial GPU result matches the serial reduction on the CPU.

Recompiling targeting multi-core CPU (i.e. the parallel loop is run across multiple CPU cores), we see similar results:

% nvfortran -ta=multicore test3.f90 ; a.out
       sum parallel:         99229944015915578405739823104.0
         sum serial:         99229944015916106171321155584.0
            sum cpu:         99229944015916106171321155584.0
 ABS Tolerance:
  cpu vs. parallel:                        527765581332480.0
    cpu vs. serial:                                      0.0
 REL Tolerance:
  cpu vs. parallel: 0.00000000000053186121041026814350007200%
    cpu vs. serial: 0.00000000000000000000000000000000000000%

Another thing to keep in mind is that optimization can effect accuracy as well. Here I’m adding the “-fast” option:

% nvfortran -ta=multicore -fast test3.f90 ; a.out
       sum parallel:         99229944015915525629181689856.0
         sum serial:         99229944015915560813553778688.0
            sum cpu:         99229944015915560813553778688.0
 ABS Tolerance:
  cpu vs. parallel:                         35184372088832.0
    cpu vs. serial:                                      0.0
 REL Tolerance:
  cpu vs. parallel: 0.00000000000003545741402735122723715574%
    cpu vs. serial: 0.00000000000000000000000000000000000000%
% nvfortran -ta=tesla -fast test3.f90 ; a.out
       sum parallel:         99229944015915560813553778688.0
         sum serial:         99229944015916106171321155584.0
            sum cpu:         99229944015915560813553778688.0
 ABS Tolerance:
  cpu vs. parallel:                                      0.0
    cpu vs. serial:                        545357767376896.0
 REL Tolerance:
  cpu vs. parallel: 0.00000000000000000000000000000000000000%
    cpu vs. serial: 0.00000000000054958991742394378867308608%

As you can see, the “sum cpu” result changes and now is different between the serial GPU result since different optimization are applied. Interestingly, the parallel GPU result is now consistent with the host CPU, but don’t take that to mean that this would occur in all cases, just this particular case.

By no means is this a comprehensive or complete primer on floating point numerical accuracy for which I would suggest you do more research if you want to learn more. Though to simply things, I think of floating point arithmetic as always being “wrong”, it’s more a matter of how “wrong” it is. Many things effect accuracy such as the algorithm being used, order of operations, the floating point representation (meaning the precision), the compiler optimizations being applied, and the hardware that’s being used. For different configurations, strict bit-for-bit may not be achievable and instead should be compared within a relative tolerance, which in your case, is quite strict.

-Mat

Thanks a lot Mat.