About GPU calculation accuracy

Hello, developers:
There is a problem of computational accuracy In my cuda fortran program(PGI 19.10 compiled). The calculation formula is below:

attributes(global) subroutine diff_u(he,h,u,qx,sx,sy,manning,eps,nx,ny,mbc_1,mbc_2,mbc_3,mbc_4)
integer ,value::nx,ny,mbc_1,mbc_2,mbc_3,mbc_4
real8 :: he(1:nx+mbc_1+mbc_2,1:ny+mbc_3+mbc_4)
8 :: h(1:nx+mbc_1+mbc_2,1:ny+mbc_3+mbc_4)
real8 :: u(1:nx+mbc_1+mbc_2,1:ny+mbc_3+mbc_4)
8 :: qx(1:nx+mbc_1+mbc_2,1:ny+mbc_3+mbc_4)
real8 :: sx(1:nx+mbc_1+mbc_2,1:ny+mbc_3+mbc_4)
8 :: sy(1:nx+mbc_1+mbc_2,1:ny+mbc_3+mbc_4)
real8 :: manning(1:nx+mbc_1+mbc_2,1:ny+mbc_3+mbc_4)
8,value :: eps
integer :: i,j
i = (blockIdx%x-1)* blockDim%x + threadIdx%x
j = (blockIdx%y-1)* blockDim%y + threadIdx%y
u(i,j) = (-(sx(i,j)+eps)/(abs(sx(i,j)+eps))) *(1/manning(i,j))he(i,j)**(2./3.) &
end subroutine diff_u

The results of u calculated on gpu and CPU are different for some points,such as:

If you have encountered a similar problem, please share the solution,thank you.

Bit for bit exactness should not be expected between results from the CPU and GPU, especially since you’re using square root, divides, and pow. Instead, you should compare if the results are within an acceptable tolerance.

Though you may wish to try adding the compilation flags “-Kieee -Mnofma”. Kieee will use more precise versions for pow and dsqrt and -Mnofma will disable FMA (Fused-Multiply Add) instructions. The GPU will use FMA by default while depending on your CPU, may or may not be used. Though these flags may cause performance slow-downs.

The problem has been solved by your method. Really appreciate your help.

In my program, I used MANAGED DATA to store data used by both the device and the host.
It can be compiled and run under Windows, but when compiled and run under Linux, the following problems occured.

How can I solve this problem?

“Illegal memory address error” is a generic error meaning that a bad memory address was accessed on the GPU. You’ll need to do some debugging to determine the cause. Does it occur when not using nvprof?

Note that CUDA Unified Memory only managed dynamically allocated data. Static data still needs to be manually manged.