precision difference between emulation mode and cpu


I have a certain piece of CUDA code and an ‘equivalent’ C host code (I mean code meant to do the same task).

Let me define –
Device emu result: output obtained when the CUDA code is compiled and run in the device emulation mode,
CPU result: output obtained from the equivalent C code,
GPU result: output obtained by compiling and running CUDA code on GPU.

Now I find that whereas the device emu result and the GPU result match, the device emu and the CPU results do not match perfectly. My understanding is that in the device emu mode the computations are performed on the CPU. How should I understand this difference then? Is the only conclusion to be drawn that there is a problem in my code?

Thanks in advance!!

That probably depends on the magnitude of the error. Floating point arithmetic isn’t commutative (presuming this is floating point you are asking about). Between the three versions of you code you effectively have

  1. CUDA using the device FPU, running with a warp size of 32

  2. CUDA using the host FPU, running with a warp size of 1

  3. host FPU running serially

There are plenty of ways each can disagree, and it would be completely normal for them to diverge. The verification/correctness question should not be whether they diverge or not, but whether the magnitude of the divergence is acceptable and explainable.