I met a very weird problem. Every time I run my cuda code, the result is slightly different from each other. But when I use device emulation mode, the result is consistent. My code is a little bit complicated but essentially a sparse matrix times a vector. Absolutely no randomization in the code. Does anybody have some insight on this problem?
It’s hard to tell from the description. Possible explanations include:
- The kernel is not successfully running, but aborting early. Are you checking error codes?
- The kernel has a race condition which only manifests on the GPU. (Device emulation is not really a good emulation of all the behavior of the device.)
Thank you for your reply. I checked error and did not find any. I am afraid of having some race condition on GPU but I don’t know how to check it? Do you have any idea on it?
I would also like to know how to check for the race condition. What I know is we can use atomic operations that can help in minimizing the race conditions. But I understand that Atomic operations are the performance killer
The Ocelot emulator will report all shared memory race conditions. http://code.google.com/p/gpuocelot/