Would these two be equivalent? I sometimes get a slightly different result and am not sure whether
this is down to using different compilers, numerical accuracy etc.
Also, when running the code Windows XP is shutting down now and then. This appears at arbitrary points. I’m using an
NVIDIA GTX 285. Each kernel runs for much less than 1 second, has ‘syncthreads()’ at the end and memory usage should not be a problem.
Did anyone else experience a similar problem? Is this perhaps related to the GPU getting too hot? Fortunately, the computer
recovered everytime so far.
Other than this, I’m quite impressed with the speedup that can be achieved.
In C (“CUDA” is just standard C90 with a handful of syntax extensions to cover dealing with kernels and the non-flat memory space of the GPU), you would want something like:
But you should still expect some differences between the floating point results produced on the GPU and CPU, mostly because the CPU will being using either 80bit extended precision or IEEE-754 double precision internally, and then rounding the results back to single precision afterwards. You should also remember that floating point arithmetic isn’t commutative, and decomposing an algorithm into parallel steps often produces a different result than executing the same code serially.