SDK program simpleMultiGPU isn't testing what it's supposed to

While learning a lot from the quite interesting SDK examples, I encountered simpleMultiGPU, which tells me that

CUDA-capable device count: 3
main(): generating input data…
main(): waiting for GPU results…
GPU Processing time: 309.846985 (ms)
Checking the results…
CPU Processing time: 38.532001 (ms)
GPU sum: 16777280.000000; CPU sum: 16777294.395033
Relative difference: 8.580068E-07
TEST PASSED
Shutting down…

OK, I’m puzzled by the CPU being an order of magnitude FASTER than GPU, but that’s another issue…(GPU RAM latency or synchronization of
3 threads? must be the latter…)

What I want to mention instead is that the numerical accuracy supposedly shown by the 8.58e-7 relative error has not, in fact, been demonstrated.

The size of the problem in SDK is DATA_N = 1048576*32 floats added together. But whenever you try to add more than about 17 million floats (e.g., 1.'s ) in a serial loop or equivalent scheme, you’ll stop accumulating the sum after it grows to the value about 16777283, on CPU, GPU, or any other PU, as long as you use single precision arithmetic. (You can try any language/compiler whatsoever.)

I suggest that the authors of this CPU+multiGPU accuracy test change it slightly to provide a more meaningful test.

I’m not doubting the capability of GPUs, by the way. Quite the opposite. On GPU, try using any parallel scan (reduction) of a huge array of floats, say, 1 billion values all equal to 1.f, and you’ll be amazed. Not only will the speed becomewhat it should be, the accuracy will also improve incredibly. As a matter of fact, I’m still a bit dizzy after seeing that a floating point result of the addition of 1G of 1.f’s gives a precise answer with, you know, 9 digits, which far exceeds (in this lucky case) the normal single prec. machine accuracy.
The explanation is simple: the parallel reduction effectively degrades accuracy only log_2(N) times not N times, where N is the number of 1’s summed, because that’s the number of times any single input value gets added to another number.

Interesting find, thank you for giving some insight into the floating point precision issues.

It’s probably the job of interns at nVidia to provide the SDK samples ;)

Probably?

...

	  for(unsigned int i = 0; i < memSize/sizeof(unsigned char); i++)

(excerpt from bandwidthTest, where the peculiarity is that sizeof char by definition is always == 1)

ACCURACY IS FINE – TIMING STILL BAD.


UH OH, SORRY! THE TEST IS FINE. The 16777239 was so suggestive that I made a silly mistake of not reading the test code more thoroughly. It does, in fact, do many partial sums (not really a log2(N) thing but enough partial sums to save the digits) and compares them with a DOUBLE precision CPU sum. I thought it does one big loop in each thread and then compares them with float CPU results.

As it is, you can increase the size of a problem up to 512M (on a 3-gpu machine)

CUDA-capable device count: 3
main(): generating input data…
main(): waiting for GPU results…
GPU Processing time: 608.094971 (ms)
Checking the results…
CPU Processing time: 615.947998 (ms)
GPU sum: 268443136.000000; CPU sum: 268443072.036977
Relative difference: 2.382741E-07
TEST PASSED
Shutting down…

the timing problem, however, is still with us… how can we speed the multiGPU calculation up? It’s not faster than CPU right now!
Let me see if this is because it’s the first invokation of threads or something like that.

Could it be that they are using the most naive parallel reduction approach for their simple multiGPU sample?

Have a look at the reduction SDK sample (and whitepaper) and the tricks they’re using to speed this up by an order of magnitude.