Precision mismatch with CUDA Fortran

AshutoshLondhe · October 18, 2023, 8:48am

Hello,

I am currently working on Re-engineering the SENGA+ application using the OPS-DSL framework. OPS-DSL is collection of software library and code translation tools which enables automatic parallelization on multi-core and many-core architecture for multi-block structured mesh algorithms.

[GitHub - OP-DSL/OPS: OPS is an API with associated libraries and preprocessors to generate parallel executables for applications on mulit-block structured meshes.]

In OPS-DSL, we need to re-write the DO-LOOPS in form on OPS_PAR_LOOP call + Elemental kernel.

// Sample loop
DO kc = 1,nzsize
        DO jc = 1,nysize
            DO ic = 1,nxsize
      
                fdiffa = functn(ic+1,jc,kc) - functn(ic-1,jc,kc)
                fdiffb = functn(ic+2,jc,kc) - functn(ic-2,jc,kc)
                fdiffc = functn(ic+3,jc,kc) - functn(ic-3,jc,kc)
                fdiffd = functn(ic+4,jc,kc) - functn(ic-4,jc,kc)
                fdiffe = functn(ic+5,jc,kc) - functn(ic-5,jc,kc)
      
                fderiv(ic,jc,kc) = acoffx*fdiffa + bcoffx*fdiffb  &
                                          + ccoffx*fdiffc + dcoffx*fdiffd  &
                                          + ecoffx*fdiffe
      
            END DO
        END DO
    END DO


// Elemental Kernel
SUBROUTINE dfbydx_kernel_interior(functn, fderiv)
        IMPLICIT NONE

        REAL(kind=8), dimension(1), intent(in) :: functn
        REAL(kind=8), dimension(1) :: fderiv
        REAL(kind=8) :: fdiffa, fdiffb, fdiffc, fdiffd, fdiffe

        fdiffa = functn(OPS_ACC1(1,0,0)) - functn(OPS_ACC1(-1,0,0))
        fdiffb = functn(OPS_ACC1(2,0,0)) - functn(OPS_ACC1(-2,0,0))
        fdiffc = functn(OPS_ACC1(3,0,0)) - functn(OPS_ACC1(-3,0,0))
        fdiffd = functn(OPS_ACC1(4,0,0)) - functn(OPS_ACC1(-4,0,0))
        fdiffe = functn(OPS_ACC1(5,0,0)) - functn(OPS_ACC1(-5,0,0))

        fderiv(OPS_ACC2(0,0,0)) = acoffx*fdiffa + bcoffx*fdiffb  &
                                                    + ccoffx*fdiffc + dcoffx*fdiffd  &
                                                    + ecoffx*fdiffe

    END SUBROUTINE dfbydx_kernel_interior

// OPS_PAR_LOOP call
    rangexyz = (/1,nxsize, 1,nysize, 1,nzsize/)
    call ops_par_loop(dfbydx_kernel_interior, "dfbydx_interior_scheme", senga_grid, 3, rangexyz,  &
                      ops_arg_dat(d_functn, 1, s3d_p500_to_m500_x, "real(8)", OPS_READ),  &
                      ops_arg_dat(d_fderiv, 1, s3d_000, "real(8)", OPS_WRITE))

//OPS_ARG_DAT (dataset, dim, stencil, data_type, access)

I pasted the code above showing how this loop transformation is done. Both these call and elemental kernel is then parsed by OPS code translation tool and relevant architecture specific template is generated for each loop.

Following is sample template generated for CPU version

MODULE DFBYDX_KERNEL_INTERIOR_MODULE
        USE OPS_FORTRAN_DECLARATIONS
        USE OPS_FORTRAN_RT_SUPPORT

        USE OPS_CONSTANTS
        USE ISO_C_BINDING
        IMPLICIT NONE

        contains

        !user function
        !DEC$ ATTRIBUTES FORCEINLINE :: dfbydx_kernel_interior
        SUBROUTINE dfbydx_kernel_interior(functn, fderiv)
	! function body from elemental kernel
        END SUBROUTINE

        SUBROUTINE dfbydx_kernel_interior_wrap( opsDat1Local, opsDat2Local, &
 dat1_base, dat2_base, start, end )
  	  …
  	  DO n_z = 1, end(3)-start(3)+1
    	        DO n_y = 1, end(2)-start(2)+1
      	             !DIR$ SIMD
      	              DO n_x = 1, end(1)-start(1)+1
        		        call dfbydx_kernel_interior( &
        			   & opsDat1Local(dat1_base+(n_x-1)*1 + (n_y-1)*xdim1*1 + &
                                              & (n_z-1)*ydim1*xdim1*1), &
                                              & opsDat2Local(dat2_base+(n_x-1)*1 + (n_y-1)*xdim2*1 + &
                                              & (n_z-1)*ydim2*xdim2*1) )
                            END DO
    	        END DO
  	  END DO
        END SUBROUTINE

        !host subroutine
        SUBROUTINE dfbydx_kernel_interior_host( userSubroutine, block, dim, range, &
              opsArg1, opsArg2)
  	  …	
  	  opsArgArray(1) = opsArg1
  	  opsArgArray(2) = opsArg2
	  …
  call ops_halo_exchanges(opsArgArray,2,range)
	  …
  	  call dfbydx_kernel_interior_wrap( opsDat1Local, opsDat2Local, &
  					      dat1_base, dat2_base, start, end )
	  …
  	  call ops_set_dirtybit_host(opsArgArray, 2)
  	  call ops_set_halo_dirtybit3(opsArg2,range)
	  …
  	  !Timing and data movement
  …
        END SUBROUTINE
END MODULE DFBYDX_KERNEL_INTERIOR_MODULE

Following is sample template generated for GPU

MODULE DFBYDX_KERNEL_INTERIOR_MODULE
       USE OPS_FORTRAN_DECLARATIONS
       USE OPS_FORTRAN_RT_SUPPORT

       USE OPS_CONSTANTS
       USE ISO_C_BINDING
       USE CUDAFOR
       
        contains

        ! user function
        attributes (device) SUBROUTINE dfbydx_kernel_interior_gpu(functn, fderiv)
    ! function body from elemental kernel
        END SUBROUTINE

        ! CUDA kernel function -- wrapper calling user kernel
        attributes (global) SUBROUTINE dfbydx_kernel_interior_wrap( opsDat1Local, opsDat2Local, &
                    dat1_base, dat2_base, &
                    size1, size2, size3 )
 	  …
                 n_z = blockDim%z * (blockIdx%z-1) + threadIdx%z
                 n_y = blockDim%y * (blockIdx%y-1) + threadIdx%y
                 n_x = blockDim%x * (blockIdx%x-1) + threadIdx%x
	   …
                 arg1 = (n_x-1) * 1*1 + (n_y-1) * 1*1 * xdim1_dfbydx_kernel_interior + (n_z-1) * 1*1 * &
               xdim1_dfbydx_kernel_interior * ydim1_dfbydx_kernel_interior
                 arg2 = (n_x-1) * 1*1 + (n_y-1) * 1*1 * xdim2_dfbydx_kernel_interior + (n_z-1) * 1*1 * &
               xdim2_dfbydx_kernel_interior * ydim2_dfbydx_kernel_interior
	   …
                 IF ((n_x-1) < size1 .AND. (n_y-1) < size2 .AND. (n_z-1) < size3) THEN
    	             call dfbydx_kernel_interior_gpu( opsDat1Local(dat1_base+arg1), &              
                                                                                       opsDat2Local(dat2_base+arg2) )
                 END IF
	   …
        END SUBROUTINE

        !host subroutine
        attributes (host) SUBROUTINE dfbydx_kernel_interior_host( userSubroutine, block, dim, & 
 range, opsArg1, opsArg2)
  	  …
     	  opsArgArray(1) = opsArg1
     	  opsArgArray(2) = opsArg2
	  …
  	  x_size = MAX(0,end(1)-start(1)+1)
  	  y_size = MAX(0,end(2)-start(2)+1)
  	  z_size = MAX(0,end(3)-start(3)+1)
 …
  	  grid = dim3( (x_size-1)/getOPS_block_size_x()+ 1, (y_size-1)/getOPS_block_size_y() + 1, 
                                        z_size)
  	  tblock = dim3(getOPS_block_size_x(),getOPS_block_size_y(),1)
  …
  	  !halo exchanges
  	  call ops_halo_exchanges(opsArgArray,2,range)
	  ! call kernel wrapper function, passing in data pointers
  	  call dfbydx_kernel_interior_wrap <<<grid, tblock>>> (opsDat1Local, opsDat2Local, &
  							            dat1_base, dat2_base, &
  						            	             x_size, y_size, z_size )
	  …
  	  istat = cudaDeviceSynchronize()
  	  …
  	  call ops_set_dirtybit_device(opsArgArray, 2)
 	  call ops_set_halo_dirtybit3(opsArg2,range)
	  …
  	  !Timing and data movement
  	  …
      END SUBROUTINE
END MODULE DFBYDX_KERNEL_INTERIOR_MODULE

Initially, when we compared the results of the original SENGA+ with the generated CPU version, we noticed differences only after the 12th or 13th decimal places, which was well within our expectations since we are using double precision type. However, when comparing these results with CUDA-accelerated computations, discrepancies began to surface as early as the 2nd or 3rd decimal places.

In our exploration of this issue, we discovered that the “Implicit none” statement inside the elemental kernel function was a contributing factor. Upon its removal, we achieved an exact match for the CPU version with the original results, which was a significant step forward.

However, we are still encountering challenges with the CUDA implementation, as differences persist after the 7th or 8th decimal places. This level of discrepancy falls short of our precision requirements. Our objective is to attain a precision match of at least the first 10 decimal places for the CUDA-accelerated computations.

Following is the information about the compiler we are using:

nvhpc/23.1

pgfortran (aka nvfortran) 23.1-0 64-bit target on x86-64 Linux -tp sandybridge

PGI Compilers and Tools

Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES

Compiler Flags: (Tesla P100 GPU)

-O3 -fast -gopt -Mdclchk -Kieee -nofma -Mnofpapprox -mcmodel=medium -Ktrap=unf -Mnofprelaxed -m64 -pc 64 -Mcuda=cc60 -Mcuda=cc60

Please provide suggestions on how we can achive the precision
Do Let me know if any more information is needed.

MatColgrove · October 18, 2023, 4:53pm

While I don’t know the specific of your code, accuracy difference between a serial code run on a CPU versus a massively parallel code run on a GPU are expected. This is primarily due to the order of operations and how rounding error get applied. Algorithmic changes could also be a factor.

While very slow, if you change your grid and block size to 1, the code will run serially on the device and you can see how much difference is due to parallelism.

Another possibility is due to FMA. While FMA often gives more accurate results (less rounding error), it can give different results than those generated without. Here you do disable FMA via the “-nofma” flag. Are the CPU results you’re comparing also without FMA?

Other possibilities are use of math intrinsics (sqrt, division, sin, cos, etc.) or errors in the code such as race conditions, data not sync’d between the host and device, out-of-bounds access, etc.

-Mat

AshutoshLondhe · October 19, 2023, 2:18pm

Hi Mat,

Following is a summary for differnces we are observing.
I re-ran the experiments with following flags

-O3 -fast -gopt  -Mdclchk -Kieee -Mnofpapprox -mcmodel=medium -Ktrap=unf -Mnofprelaxed -m64 -pc 64

For comparing the datasets, i was using h5diff commnad with --delta option

Here is a summary:

When comparing the Original SENGA with CPU version generated using OPS-DSL library

No differences observed when using --delta=1e-14

When comparing the Original SENGA with GPU version (CUDA) generated using OPS-DSL library

For most of the datasets no difference observed using --delta=1e-12
For 2 datasets no differences were observed using --delta=1e-7

When Comparing the Original SENGA with Mix-GPU version (kernels involving math fiunction ran on CPU and remaining on GPU) generated by OPS-DSL library

For most of the datasets no difference observed using --delta=1e-13
For 2 datasets no differences were observed using --delta=1e-10

Also as per your suggestion, I also ran GPU version with different block sizes. I used (x=32,y=2) and (x=64,y=4) blocksizes and compared the results of these two. Its a exact match, which suggest that parallelization is not causing an issue here.

Please take into consideration that the comparison I’ve performed is based on data from only 128 iterations, whereas in real-world scenarios, datasets like the Taylor Green Vortex evolve over millions of iterations. I am worried that in the case the errors can accumulate over time, and differences may start emerging from the very beginning decimal places as the simulation progresses.

Topic		Replies	Views
Rounding error problems CUDA Programming and Performance	22	2269	June 28, 2017
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8229	September 9, 2011
Fermi architecture details where can I find them? CUDA Programming and Performance	16	4005	April 8, 2012
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	679	April 4, 2017
performance difference for cuda between experiments and the documentation for float/double data type... CUDA Programming and Performance	8	1900	October 28, 2016
Accuracy of 1D linear interpolation by CUDA texture interpolation CUDA Programming and Performance	25	14623	January 29, 2013
CUDA book by Kirk & Whu available CUDA Programming and Performance	44	12109	February 10, 2010
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134568	May 26, 2010
Unacceptably low accuracy of calculations on the GPU CUDA Programming and Performance cuda	12	1695	May 15, 2024
What are you guys doing with cuda? just wanna find a way to go CUDA Programming and Performance	81	56090	February 7, 2013

Precision mismatch with CUDA Fortran

Related topics