Precision mismatch with CUDA Fortran

Hello,

I am currently working on Re-engineering the SENGA+ application using the OPS-DSL framework. OPS-DSL is collection of software library and code translation tools which enables automatic parallelization on multi-core and many-core architecture for multi-block structured mesh algorithms.

[GitHub - OP-DSL/OPS: OPS is an API with associated libraries and preprocessors to generate parallel executables for applications on mulit-block structured meshes.]

In OPS-DSL, we need to re-write the DO-LOOPS in form on OPS_PAR_LOOP call + Elemental kernel.

// Sample loop
DO kc = 1,nzsize
        DO jc = 1,nysize
            DO ic = 1,nxsize
      
                fdiffa = functn(ic+1,jc,kc) - functn(ic-1,jc,kc)
                fdiffb = functn(ic+2,jc,kc) - functn(ic-2,jc,kc)
                fdiffc = functn(ic+3,jc,kc) - functn(ic-3,jc,kc)
                fdiffd = functn(ic+4,jc,kc) - functn(ic-4,jc,kc)
                fdiffe = functn(ic+5,jc,kc) - functn(ic-5,jc,kc)
      
                fderiv(ic,jc,kc) = acoffx*fdiffa + bcoffx*fdiffb  &
                                          + ccoffx*fdiffc + dcoffx*fdiffd  &
                                          + ecoffx*fdiffe
      
            END DO
        END DO
    END DO


// Elemental Kernel
SUBROUTINE dfbydx_kernel_interior(functn, fderiv)
        IMPLICIT NONE

        REAL(kind=8), dimension(1), intent(in) :: functn
        REAL(kind=8), dimension(1) :: fderiv
        REAL(kind=8) :: fdiffa, fdiffb, fdiffc, fdiffd, fdiffe

        fdiffa = functn(OPS_ACC1(1,0,0)) - functn(OPS_ACC1(-1,0,0))
        fdiffb = functn(OPS_ACC1(2,0,0)) - functn(OPS_ACC1(-2,0,0))
        fdiffc = functn(OPS_ACC1(3,0,0)) - functn(OPS_ACC1(-3,0,0))
        fdiffd = functn(OPS_ACC1(4,0,0)) - functn(OPS_ACC1(-4,0,0))
        fdiffe = functn(OPS_ACC1(5,0,0)) - functn(OPS_ACC1(-5,0,0))

        fderiv(OPS_ACC2(0,0,0)) = acoffx*fdiffa + bcoffx*fdiffb  &
                                                    + ccoffx*fdiffc + dcoffx*fdiffd  &
                                                    + ecoffx*fdiffe

    END SUBROUTINE dfbydx_kernel_interior

// OPS_PAR_LOOP call
    rangexyz = (/1,nxsize, 1,nysize, 1,nzsize/)
    call ops_par_loop(dfbydx_kernel_interior, "dfbydx_interior_scheme", senga_grid, 3, rangexyz,  &
                      ops_arg_dat(d_functn, 1, s3d_p500_to_m500_x, "real(8)", OPS_READ),  &
                      ops_arg_dat(d_fderiv, 1, s3d_000, "real(8)", OPS_WRITE))

//OPS_ARG_DAT (dataset, dim, stencil, data_type, access)

I pasted the code above showing how this loop transformation is done. Both these call and elemental kernel is then parsed by OPS code translation tool and relevant architecture specific template is generated for each loop.

Following is sample template generated for CPU version

MODULE DFBYDX_KERNEL_INTERIOR_MODULE
        USE OPS_FORTRAN_DECLARATIONS
        USE OPS_FORTRAN_RT_SUPPORT

        USE OPS_CONSTANTS
        USE ISO_C_BINDING
        IMPLICIT NONE

        contains

        !user function
        !DEC$ ATTRIBUTES FORCEINLINE :: dfbydx_kernel_interior
        SUBROUTINE dfbydx_kernel_interior(functn, fderiv)
	! function body from elemental kernel
        END SUBROUTINE

        SUBROUTINE dfbydx_kernel_interior_wrap( opsDat1Local, opsDat2Local, &
 dat1_base, dat2_base, start, end )
  	  …
  	  DO n_z = 1, end(3)-start(3)+1
    	        DO n_y = 1, end(2)-start(2)+1
      	             !DIR$ SIMD
      	              DO n_x = 1, end(1)-start(1)+1
        		        call dfbydx_kernel_interior( &
        			   & opsDat1Local(dat1_base+(n_x-1)*1 + (n_y-1)*xdim1*1 + &
                                              & (n_z-1)*ydim1*xdim1*1), &
                                              & opsDat2Local(dat2_base+(n_x-1)*1 + (n_y-1)*xdim2*1 + &
                                              & (n_z-1)*ydim2*xdim2*1) )
                            END DO
    	        END DO
  	  END DO
        END SUBROUTINE

        !host subroutine
        SUBROUTINE dfbydx_kernel_interior_host( userSubroutine, block, dim, range, &
              opsArg1, opsArg2)
  	  …	
  	  opsArgArray(1) = opsArg1
  	  opsArgArray(2) = opsArg2
	  …
  call ops_halo_exchanges(opsArgArray,2,range)
	  …
  	  call dfbydx_kernel_interior_wrap( opsDat1Local, opsDat2Local, &
  					      dat1_base, dat2_base, start, end )
	  …
  	  call ops_set_dirtybit_host(opsArgArray, 2)
  	  call ops_set_halo_dirtybit3(opsArg2,range)
	  …
  	  !Timing and data movement
  …
        END SUBROUTINE
END MODULE DFBYDX_KERNEL_INTERIOR_MODULE

Following is sample template generated for GPU

MODULE DFBYDX_KERNEL_INTERIOR_MODULE
       USE OPS_FORTRAN_DECLARATIONS
       USE OPS_FORTRAN_RT_SUPPORT

       USE OPS_CONSTANTS
       USE ISO_C_BINDING
       USE CUDAFOR
       
        contains

        ! user function
        attributes (device) SUBROUTINE dfbydx_kernel_interior_gpu(functn, fderiv)
    ! function body from elemental kernel
        END SUBROUTINE

        ! CUDA kernel function -- wrapper calling user kernel
        attributes (global) SUBROUTINE dfbydx_kernel_interior_wrap( opsDat1Local, opsDat2Local, &
                    dat1_base, dat2_base, &
                    size1, size2, size3 )
 	  …
                 n_z = blockDim%z * (blockIdx%z-1) + threadIdx%z
                 n_y = blockDim%y * (blockIdx%y-1) + threadIdx%y
                 n_x = blockDim%x * (blockIdx%x-1) + threadIdx%x
	   …
                 arg1 = (n_x-1) * 1*1 + (n_y-1) * 1*1 * xdim1_dfbydx_kernel_interior + (n_z-1) * 1*1 * &
               xdim1_dfbydx_kernel_interior * ydim1_dfbydx_kernel_interior
                 arg2 = (n_x-1) * 1*1 + (n_y-1) * 1*1 * xdim2_dfbydx_kernel_interior + (n_z-1) * 1*1 * &
               xdim2_dfbydx_kernel_interior * ydim2_dfbydx_kernel_interior
	   …
                 IF ((n_x-1) < size1 .AND. (n_y-1) < size2 .AND. (n_z-1) < size3) THEN
    	             call dfbydx_kernel_interior_gpu( opsDat1Local(dat1_base+arg1), &              
                                                                                       opsDat2Local(dat2_base+arg2) )
                 END IF
	   …
        END SUBROUTINE

        !host subroutine
        attributes (host) SUBROUTINE dfbydx_kernel_interior_host( userSubroutine, block, dim, & 
 range, opsArg1, opsArg2)
  	  …
     	  opsArgArray(1) = opsArg1
     	  opsArgArray(2) = opsArg2
	  …
  	  x_size = MAX(0,end(1)-start(1)+1)
  	  y_size = MAX(0,end(2)-start(2)+1)
  	  z_size = MAX(0,end(3)-start(3)+1)
 …
  	  grid = dim3( (x_size-1)/getOPS_block_size_x()+ 1, (y_size-1)/getOPS_block_size_y() + 1, 
                                        z_size)
  	  tblock = dim3(getOPS_block_size_x(),getOPS_block_size_y(),1)
  …
  	  !halo exchanges
  	  call ops_halo_exchanges(opsArgArray,2,range)
	  ! call kernel wrapper function, passing in data pointers
  	  call dfbydx_kernel_interior_wrap <<<grid, tblock>>> (opsDat1Local, opsDat2Local, &
  							            dat1_base, dat2_base, &
  						            	             x_size, y_size, z_size )
	  …
  	  istat = cudaDeviceSynchronize()
  	  …
  	  call ops_set_dirtybit_device(opsArgArray, 2)
 	  call ops_set_halo_dirtybit3(opsArg2,range)
	  …
  	  !Timing and data movement
  	  …
      END SUBROUTINE
END MODULE DFBYDX_KERNEL_INTERIOR_MODULE

Initially, when we compared the results of the original SENGA+ with the generated CPU version, we noticed differences only after the 12th or 13th decimal places, which was well within our expectations since we are using double precision type. However, when comparing these results with CUDA-accelerated computations, discrepancies began to surface as early as the 2nd or 3rd decimal places.

In our exploration of this issue, we discovered that the “Implicit none” statement inside the elemental kernel function was a contributing factor. Upon its removal, we achieved an exact match for the CPU version with the original results, which was a significant step forward.

However, we are still encountering challenges with the CUDA implementation, as differences persist after the 7th or 8th decimal places. This level of discrepancy falls short of our precision requirements. Our objective is to attain a precision match of at least the first 10 decimal places for the CUDA-accelerated computations.

Following is the information about the compiler we are using:

nvhpc/23.1

pgfortran (aka nvfortran) 23.1-0 64-bit target on x86-64 Linux -tp sandybridge

PGI Compilers and Tools

Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES

Compiler Flags: (Tesla P100 GPU)

-O3 -fast -gopt -Mdclchk -Kieee -nofma -Mnofpapprox -mcmodel=medium -Ktrap=unf -Mnofprelaxed -m64 -pc 64 -Mcuda=cc60 -Mcuda=cc60

Please provide suggestions on how we can achive the precision
Do Let me know if any more information is needed.

While I don’t know the specific of your code, accuracy difference between a serial code run on a CPU versus a massively parallel code run on a GPU are expected. This is primarily due to the order of operations and how rounding error get applied. Algorithmic changes could also be a factor.

While very slow, if you change your grid and block size to 1, the code will run serially on the device and you can see how much difference is due to parallelism.

Another possibility is due to FMA. While FMA often gives more accurate results (less rounding error), it can give different results than those generated without. Here you do disable FMA via the “-nofma” flag. Are the CPU results you’re comparing also without FMA?

Other possibilities are use of math intrinsics (sqrt, division, sin, cos, etc.) or errors in the code such as race conditions, data not sync’d between the host and device, out-of-bounds access, etc.

-Mat

Hi Mat,

Following is a summary for differnces we are observing.
I re-ran the experiments with following flags

-O3 -fast -gopt  -Mdclchk -Kieee -Mnofpapprox -mcmodel=medium -Ktrap=unf -Mnofprelaxed -m64 -pc 64

For comparing the datasets, i was using h5diff commnad with --delta option

Here is a summary:

  1. When comparing the Original SENGA with CPU version generated using OPS-DSL library

No differences observed when using --delta=1e-14

  1. When comparing the Original SENGA with GPU version (CUDA) generated using OPS-DSL library

For most of the datasets no difference observed using --delta=1e-12
For 2 datasets no differences were observed using --delta=1e-7

  1. When Comparing the Original SENGA with Mix-GPU version (kernels involving math fiunction ran on CPU and remaining on GPU) generated by OPS-DSL library

For most of the datasets no difference observed using --delta=1e-13
For 2 datasets no differences were observed using --delta=1e-10

Also as per your suggestion, I also ran GPU version with different block sizes. I used (x=32,y=2) and (x=64,y=4) blocksizes and compared the results of these two. Its a exact match, which suggest that parallelization is not causing an issue here.

Please take into consideration that the comparison I’ve performed is based on data from only 128 iterations, whereas in real-world scenarios, datasets like the Taylor Green Vortex evolve over millions of iterations. I am worried that in the case the errors can accumulate over time, and differences may start emerging from the very beginning decimal places as the simulation progresses.