I have a huge program in cuda fortran. I tried to parallelized a new subroutine. My problem is that it work perfect when I built it in emulation mode using -o -Mcuda=emu. but when I built it and run it with out -Mcuda=emu although I can run it and there is no error but the answers are not what they should be. I tried many different things. I really don’t know what should l do to find the problem. I was wondering if there are special reasons when programs work well in emulation mode but not other than that.
While emulation mode is very helpful in discovering some bugs, one of the things it can’t do is recreate is the massively parallel environment of the GPU. So one possible cause is if you have a race condition or other memory (global or shared) contention issue. Also, the GPU may use different numerical methods (such as FMA) which can yield slightly different answers.
First thing to try is add the flag “-Mcuda=nofma” to disable the use of fuse-multiply-add instructions.
If that doesn’t do fix the issue, create temporary arrays that capture intermediate values in your kernel. Then compare these values with the same ones captured when running in emulation mode. While this method does take a bit of hunting, it will help narrow down where in the code the numbers diverge.
Hope this helps,