GPU Code and CPU Code output not matching till machine precision (i.e. 13 decimals places)

I am currently working on parallelising a code, and I am facing an issue with the accuracy of the GPU code and serial code output. I am comparing residue values after 2000 iterations of the code, and the results are as follows:
Serial code: -1.724213441322336
GPU code : -1.724213441293091

I have ensured double precision at all steps by specifying a d0 with constants and even exponents with a ‘d’ instead of an ‘e’.
A common fix provided on this forum was to disable FMA(fused multiply-add). While that improved the accuracy, the GPU code output went on diverging on running more iterations.

I am operating the code on an A100 GPU with a compute capability of 8.0

I appreciate any help you can provide.

PS:- In case there is a requirement to look at the problematic modules, I would be happy to provide those

Floating point math is not associative. If the executed order of operations does not exactly match the order in the cpu version, the gpu results may be different. This may be the case, for example, in a parallel reduction.

I have ensured that the GPU code order of operations matches the CPU code operation. The actual problem arises when there is an if condition that executes, i.e. there is a flag that, when turned on, starts giving diverging outputs. If that flag is set to 0, there is no precision problem, but if it is set to 1, it gives an output that lacks precision.

I was wondering whether the presence of very small operands could be causing the precision issue.

Can you show a minimal reproducer for your observations?

Here are the problematic parts:

if(flag .eq. 1) then
!
do j = 1, yes(i)
!
k = yls(j,i)
!
x_k = x1(k)
y_k = y1(k)
z_k = z1(k)
!
delx = x_k - x_i
dely = y_k - y_i
delz = z_k - z_i
!
temp = delxdq2(1,:,k) + delydq2(2,:,k) + delzdq2(3,:,k)
var = q2(:,k) - 0.5d0
temp
call func(var, f_k, i, q2, qm1, d)
!
temp = delxdq2(1,:,i) + delydq2(2,:,i) + delzdq2(3,:,i)
var = q2(:,i) - 0.5d0
temp
call func(var, f_i, i, q2, qm1, d)

			do r = 1, 5
	    			if(f_k(r) .lt. f(r)) then
					f(r) = f_k(r)
				endif
				if(f_i(r) .lt. f(r)) then
					f(r) = f_i(r)
				endif					
			enddo
		enddo
	endif

I am also attaching the function that is embedded inside:

attributes(device) subroutine func(var, f_func, k, q3, qm3, d_func)
!
!
implicit none
!
integer :: r
real8 :: q, dn, dp
real
8 :: max_q, min_q, ds, epsi, num, den, temp

	integer::k
	real*8:: var(5), f_func(5)

	real*8, dimension(:,:,:), device :: qm3
	real*8, dimension(:,:), device :: q3
	real*8, dimension(:), device :: d_func

	real*8, constant :: const = 0.1d0

!
do r = 1, 5
q = q3(r, k)
dn = var(r) - q
!
if(dabs(dn) .le. 10d-6) then
f_func(r)=1.0d0
else if(dabs(dn) .gt. 10d-6) then
if(dn .gt. 0.0d0) then
dp = qm3(1,r,k) - q
else if(dn .lt. 0.0d0) then
dp = qm3(2,r,k) - q
endif
!
epsi = const*d_func(k)
epsi = epsi**3.0d0

			num = (dp*dp) + (epsi*epsi)  ! Numerator .. 
			num = num*dn + 2.0d0*dn*dn*dp

			den = dp*dp + 2.0d0*dn*dn ! Denominator ..
			den = den + dn*dp + epsi*epsi
			den = den*dn
			
			temp = num/den

!
if(temp .lt. 1.0d0) then
f_func(r) = temp
else
f_func(r) = 1.0d0
endif
endif
enddo
!
end subroutine

By minimal reproducer I meant a minimal program that can be compiled and run which shows the differences between GPU and CPU. I cannot help you with fortran code though.

If you have ensured that there are no bugs in your code, it is still possible that CPU and GPU results may differ. That does not necessarily mean that one or the other is incorrect. This CUDA document may be worth a read: Floating Point and IEEE 754 Compliance

Thank you!

Let me challenge the statement that accuracy was improved by disabling FMA contraction with -fmad=false. Unless it can be demonstrated (by comparison with a high-precision reference, for example) that the serial code produces results that are in fact more accurate than the GPU code, I would claim that all that has been shown is that the difference between the result from serial code and GPU code was reduced.

While FMA contraction can on occasion lead to a reduction in the accuracy of a computation, the far more frequent case is that it improves accuracy through two mechanisms: (1) by incorporating the unrounded, fully accurate product into the sum and rounding once at the end, accumulation of round-off error is reduced (2) by incorporating the unrounded, fully accurate product into the sum it offers protection against certain cases of subtractive cancellation.

While use of an arbitrary-precision library is an imperfect substitute for true mathematical computation, I have found in practice that computing with three times the target precision provides a good “golden” reference to assess the accuracy of the computation under test. So when checking the accuracy of double-precision computation, I set the precision of my arbitrary-precision library to > 150 bits.

It is not clear (to me, anyhow) what func() computes and whether it does so in the most numerically advantageous way. I notice multiple sums of products, which may benefit from an FMA-enhanced idiom, which I explained here:

Note: In order for code to appear readable in these forums, the appropriate markup must be used. One method is to highlight the code and then click the </> button at the top of the edit window. Another method (the one I prefer) is to precede and follow the code with a line consisting only of three back ticks: ```

2 Likes

Thank you for the detailed explanation!
Also, I might’ve misused the word accuracy. I meant matching the GPU code output to the CPU code output. I will look into the link that you provided and try to make changes accordingly

Note that I am recommending to assess accuracy accurately before making any changes to the code. In other words, you would want to establish a baseline with respect to the accuracy of the computation.

1 Like

Right, I got it

I will let you know if I’m facing issues or have some doubts regarding the same

Thank you

Under quite benign assumptions (dp, dn, epsi uniformly distributed in [-1, 1]) it appears that the result of func() can incur large amounts of relative error.

The most advantageous arrangement of the computation from an accuracy and performance perspective that I have identified so far is:

num = fma (fma (2, dn, dp), dp, epsi * epsi);
den = fma (dp, dp + dn, fma (2 * dn, dn, epsi * epsi));
temp = num / den;

Not sure whether this can be expressed in Fortran; it seems even modern Fortran still lacks an intrinsics function FMA?

I could use the intrinsic FMA function by including the libm library in cuda fortran. The residue value that I am getting now is: -1.724213441419504
Is there a requirement to disable fma using -Mnofma or -fmad=false before using the arrangement?

The CUDA C++ compiler does not touch FMA operations specified by the programmer via calls to fma(). With -fmad=true, which is the nvcc default, the compiler is allowed to generate FMA operations by itself by contracting dependent FMUL and FADD operations. No other re-association of floating-point computation takes place, as far as I know.

Note that in even moderately complex expressions, there is often more than one way to perform FMA-contraction, and the compiler will pick the one it likes best, which may not be the one a programmer may have in mind. Consider a * b + c * d: This could be contracted as fma (a, b, c * d) or as fma (c, d, a*b) and the two versions will generally not produce identical results because floating-point arithmetic is not associative like math.

I don’t know what CUDA Fortran does. In general the Fortran standard is much more permissive regarding the re-association of floating-point expressions, and best I know any compiler transformation is allowed that is mathematically equivalent. This can lead to some very undesirable behavior, for example when coding around issues of potential overflow in intermediate computation only to encounter overflow at run-time because the compiler re-arranged the computation so it is sensitive to the overflow issue again.

IMHO, C++ is clearly the superior choice compared to Fortran when trying to control numerical properties of floating-point computations.

2 Likes

Hmmm, that puts me in a soup then.
I have realised that any change I make to func() regarding the order of operations seems to change the output.
I will probably need to figure out how Fortran interprets this order of operations when run on the GPU.

This realization is valuable in and of itself. This computation seems to be somewhat ill-conditioned, and that has an impact on your final results, which are only accurate to about 10 digits with double-precision computation, the rest being noise.

I do not know what is being computed here. I tried to find out under the assumption that this being part of a interpolation or integration scheme or part of a solver for differential equations, but could not find anything in the literature that matches.

You may want to revisit the issue at the higher algorithmic level to research whether there are more stable methods of computing whatever it is you are computing here. Or you may decide that based on what the results are being used for 10 decimal digits is sufficiently accurate (anything being measured or controlled in the physical world is typically fine with that kind of accuracy).

Thanks a lot for your advice
I will look into the numerical instability aspect
I also suspect some compiler related flags to be turned on or off to suit the computation more.

As opposed to various host compilers, the CUDA C++ compile treats floating-point computation quite conservatively. Double-precision computation should not be affected by flags other than -fmad. This defaults to true, primarily due to the central importance of FMA operations for GPU performance, but also due to the generally positive effect on overall computational accuracy.

The story is a bit different for single-precision computation, which is also affected by the -ftz, -prec-div, -prec-sqrt and -use_fast_math flags, however these all default to conservative settings.

I had one question:
If fma is inherently invoked in the GPU code, what is the need of the explicit use of the intrinsic fma() function?
How does it help with the output matching?

First off, “output matching” between host and device code is a Bad Practice. Unless you work at a bank maybe, where bug-for-bug compatibility is a thing, or so I have been told. The goal should be to generate results that are as accurate as possible, and yes, this means that in general results will not be bit-wise identical between different platforms. That is what documented error bounds are for. This was considered a perfectly normal state of affairs until 1990 or so.

When dealing with numerically sensitive floating-point code, specifying where to use FMAs manually takes the decisions of how to re-associate a floating-point computation out of the hand of the compiler. As I showed above for a trivial computation a*b+c*d, the compiler has choices on where to use FMAs, and each choice (1) can be highly relevant to the accuracy of the computation and (2) can change without notice with every revision of the compiler. So for numerically sensitive code, I consider manual use of fma() a Best Practice. Examples within CUDA are the emulation sequences for divisions and square roots, and pretty much the entire standard math library.

In this case I ran a few quick experiments that already showed that the accuracy of func() is quite sensitive to re-association of the floating-point computation, and even with my simplistic assumption (arguments uniformly distributed across [-1,1]) I found it impossible to correctly predict which of the half dozen variants I tried was the most advantageous; I basically had to figure it out experimentally. There is a free online tool that attempts to tune floating-point expressions for accuracy automagically, but (much to my chagrin) it has never found solutions better than the ones I puzzled out myself, and I have tried it quite a few times over the past five years or so.

1 Like