Different results in emu vs. release mode

Hi everyone,

In my kernel implementation, I see different numerical results when running in emulation vs. the release mode. Are there any known issues regarding this?


There are lots. Difference in precision for single-precision transcendentals, double precision, serialization of threads, lack of MAD, storing intermediate results in x87 registers and therefore performing operations in DP… yeah, there are lots (this is a very brief list).

-deviceemu isn’t really a device emulator from what I can tell. If you disassemble the binary it appears that the kernel is just passed to your C compiler with inline functions for all the CUDA specific stuff. Then it launches one pthread per thread you launch in a kernel, which is not so fun.

99% of the time its best to just write your kernel piece by piece, test it for hangs and out of bounds access in deviceemu mode, then test the intermediate results on the GPU and write a little bit more. You’ll hit tons of bugs and have to write work arounds (less efficient ways, but they work).