I am currently working on parallelising a code, and I am facing an issue with the accuracy of the GPU code and serial code output. I am comparing residue values after 2000 iterations of the code, and the results are as follows:
Serial code: -1.724213441322336
GPU code : -1.724213441293091
I have ensured double precision at all steps by specifying a d0 with constants and even exponents with a ‘d’ instead of an ‘e’.
A common fix provided on this forum was to disable FMA(fused multiply-add). While that improved the accuracy, the GPU code output went on diverging on running more iterations.
I am operating the code on an A100 GPU with a compute capability of 8.0
I appreciate any help you can provide.
PS:- In case there is a requirement to look at the problematic modules, I would be happy to provide those