While learning a lot from the quite interesting SDK examples, I encountered simpleMultiGPU, which tells me that
CUDA-capable device count: 3
main(): generating input data…
main(): waiting for GPU results…
GPU Processing time: 309.846985 (ms)
Checking the results…
CPU Processing time: 38.532001 (ms)
GPU sum: 16777280.000000; CPU sum: 16777294.395033
Relative difference: 8.580068E-07
TEST PASSED
Shutting down…
OK, I’m puzzled by the CPU being an order of magnitude FASTER than GPU, but that’s another issue…(GPU RAM latency or synchronization of
3 threads? must be the latter…)
What I want to mention instead is that the numerical accuracy supposedly shown by the 8.58e-7 relative error has not, in fact, been demonstrated.
The size of the problem in SDK is DATA_N = 1048576*32 floats added together. But whenever you try to add more than about 17 million floats (e.g., 1.'s ) in a serial loop or equivalent scheme, you’ll stop accumulating the sum after it grows to the value about 16777283, on CPU, GPU, or any other PU, as long as you use single precision arithmetic. (You can try any language/compiler whatsoever.)
I suggest that the authors of this CPU+multiGPU accuracy test change it slightly to provide a more meaningful test.
I’m not doubting the capability of GPUs, by the way. Quite the opposite. On GPU, try using any parallel scan (reduction) of a huge array of floats, say, 1 billion values all equal to 1.f, and you’ll be amazed. Not only will the speed becomewhat it should be, the accuracy will also improve incredibly. As a matter of fact, I’m still a bit dizzy after seeing that a floating point result of the addition of 1G of 1.f’s gives a precise answer with, you know, 9 digits, which far exceeds (in this lucky case) the normal single prec. machine accuracy.
The explanation is simple: the parallel reduction effectively degrades accuracy only log_2(N) times not N times, where N is the number of 1’s summed, because that’s the number of times any single input value gets added to another number.