The performance of Nvidia GTX 650 in nbody example

Hello - I recently installed GTX 650 on my desktop and would like to test its performance by running CUDA SKD nbody example. At default settings (num of bodies = 2048) the example was running at around 27 GFLOPS/sec. That is even lower than my previous old 8600 GTS. Any idea?

I was using CUDA 5.0 on Windows XP 32 bit, Intel Xeon 2.8GHz with 1GB RAM, GTX 650.

I am not familiar with that card, but I suspect the default number of bodies may be too small a problem size to exploit all available parallelism on this GPU. The low default setting of the nbody sample app was probably chosen so that the app still runs OK with the lowest-end GPUs out there.

I would suggest trying with something like 65536 bodies instead. BTW, are you running this in single-precision or double-precision mode?

For performance comparisons, you would also want to run with the -benchmark switch that turns off the graphics output. Please note that the SDK apps are not designed for benchmarking purposes, so results should be interpreted with caution.

Thanks. I tried 65536 bodies and it gave me around 300 Gflops/sec. That’s much better compared to my old 8600 GTS at ~ 45 Gflops/sec.

Another interesting thing happened is that the graphics wouldn’t work, for some specific num of bodies input. For example, # of bodies = 2017 would successfully launch the graphics, but 2016 wouldn’t. In that case, I had to use -benchmark option so that no graphics is used.

It works for all powers of 2 though, e.g. 1024, 2048, 65536.

This nbody demo (single precision) is a handy indicator for our own kernels. Could someone with a K10 or K20 run this, and report how many GFLOPS with CUDA 5 and -numbodies=65536 ?

For reference, I just tried this with an EVGA GTX 680 4GB Classified (1.1 GHz):

[A] nbody.exe -numbodies=65536 [1275 GFLOPs]
[B] nbody.exe -benchmark -numbodies=65536 [1308 GFLOPs]

thanks!

Note that you would want to increase the -nbodies argument with the size of the GPU to fully exploit all available parallelism. A good heuristic may be 16384 bodies per SM/SMX. Note that there are multiple models of K20. The K20c (that is the actively cooled card for workstations) has 13 SMXs, the K20X module has 15 SMXs. So for roughly comparable numbers, I would suggest the following for single-precision runs:

-nbodies=212992 for K20c
-nbodies=245760 for K20X
-nbodies=262144 for GTX680
-nbodies=262144 for M2090

The caveats about using SDK example apps as benchmarking apps apply. In particular the SDK apps are unlikely to be fully tuned to any particular platform, and may happen to achieve different percentages of peak performance.

device 0 is k20c
device 1 is EVGA GTX680 FTW, 1150mhz

nbody -benchmark -numbodies=212992 -device=0
[1305.439 single-precision GFLOP/s at 20 flops per interaction]

nbody -benchmark -numbodies=212992 -device=0 -fp64
[604.975 double-precision GFLOP/s at 30 flops per interaction]

nbody -benchmark -numbodies=262144 -device=1
[1333.898 single-precision GFLOP/s at 20 flops per interaction]

nbody -benchmark -numbodies=262144 -device=1 -fp64
[113.706 double-precision GFLOP/s at 30 flops per interaction]

Thanks rjl - those 4 data points are very helpful!

Interesting, I ran this benchmark on a server with Nvidia GTX Titan and got better results than your k20c numbers above. The server is 2x10 core E5-2690-v2 with PciE-v3 and 8x8GB 1866MHz DDR3 RAM. CUDA is 5.5 and Nvidia driver version is 319.49.

nbody -benchmark -numbodies=212992 -device=0
1868.738 single-precision GFLOP/s at 20 flops per interaction

nbody -benchmark -numbodies=212992 -device=0 -fp64
742.626 double-precision GFLOP/s at 30 flops per interaction

Even more remarkable: if I put two GTX Titan I got more than linear speedup:

nbody -benchmark -numbodies=212992 -numdevices=2
7222.349 single-precision GFLOP/s at 20 flops per interaction

nbody -benchmark -numbodies=212992 -numdevices=2 -fp64
2661.470 double-precision GFLOP/s at 30 flops per interaction

Is this for real? Or is this nbody benchmark unreliable?