I am porting my serial version of code on to the GPU using OpenACC. Although the speedup and the solution output is good, I am seeing a slight (very slight) difference in the solution output (in my case density of a flow-field) between the CPU (serial) and GPU (parallel) runs.
I want to know if this is common and observed commonly by others too, or I am having a tiny bug in my GPU porting that I should take care of. Also please let me know if any of you have experienced similar kind of stuff while porting your code.
It is worth noting that the nature of the program that I am testing is very sensitive to the initial conditions (because of the nature of equations involved in the background) and also to the intermediate states that occur in the time loop. Some major bit flips during execution can change the final solution to an observable extent.
I am running my program using Fortran in double precision on Nvidia V100 with pgfortran compiler.
I have posted this question on stack overflow but the question was removed for some reason⊠I hope my question is clear.