The performance of Nvidia GTX 650 in nbody example

cygnusA · December 12, 2012, 9:48pm

Hello - I recently installed GTX 650 on my desktop and would like to test its performance by running CUDA SKD nbody example. At default settings (num of bodies = 2048) the example was running at around 27 GFLOPS/sec. That is even lower than my previous old 8600 GTS. Any idea?

I was using CUDA 5.0 on Windows XP 32 bit, Intel Xeon 2.8GHz with 1GB RAM, GTX 650.

njuffa · December 13, 2012, 1:48am

I am not familiar with that card, but I suspect the default number of bodies may be too small a problem size to exploit all available parallelism on this GPU. The low default setting of the nbody sample app was probably chosen so that the app still runs OK with the lowest-end GPUs out there.

I would suggest trying with something like 65536 bodies instead. BTW, are you running this in single-precision or double-precision mode?

For performance comparisons, you would also want to run with the -benchmark switch that turns off the graphics output. Please note that the SDK apps are not designed for benchmarking purposes, so results should be interpreted with caution.

cygnusA · December 13, 2012, 4:16pm

Thanks. I tried 65536 bodies and it gave me around 300 Gflops/sec. That’s much better compared to my old 8600 GTS at ~ 45 Gflops/sec.

Another interesting thing happened is that the graphics wouldn’t work, for some specific num of bodies input. For example, # of bodies = 2017 would successfully launch the graphics, but 2016 wouldn’t. In that case, I had to use -benchmark option so that no graphics is used.

It works for all powers of 2 though, e.g. 1024, 2048, 65536.

nnunn · December 26, 2012, 5:38pm

This nbody demo (single precision) is a handy indicator for our own kernels. Could someone with a K10 or K20 run this, and report how many GFLOPS with CUDA 5 and -numbodies=65536 ?

For reference, I just tried this with an EVGA GTX 680 4GB Classified (1.1 GHz):

[A] nbody.exe -numbodies=65536 [1275 GFLOPs]
[B] nbody.exe -benchmark -numbodies=65536 [1308 GFLOPs]

thanks!

njuffa · December 26, 2012, 9:58pm

Note that you would want to increase the -nbodies argument with the size of the GPU to fully exploit all available parallelism. A good heuristic may be 16384 bodies per SM/SMX. Note that there are multiple models of K20. The K20c (that is the actively cooled card for workstations) has 13 SMXs, the K20X module has 15 SMXs. So for roughly comparable numbers, I would suggest the following for single-precision runs:

-nbodies=212992 for K20c
-nbodies=245760 for K20X
-nbodies=262144 for GTX680
-nbodies=262144 for M2090

The caveats about using SDK example apps as benchmarking apps apply. In particular the SDK apps are unlikely to be fully tuned to any particular platform, and may happen to achieve different percentages of peak performance.

rjl · December 31, 2012, 6:06pm

device 0 is k20c
device 1 is EVGA GTX680 FTW, 1150mhz

nbody -benchmark -numbodies=212992 -device=0
[1305.439 single-precision GFLOP/s at 20 flops per interaction]

nbody -benchmark -numbodies=212992 -device=0 -fp64
[604.975 double-precision GFLOP/s at 30 flops per interaction]

nbody -benchmark -numbodies=262144 -device=1
[1333.898 single-precision GFLOP/s at 20 flops per interaction]

nbody -benchmark -numbodies=262144 -device=1 -fp64
[113.706 double-precision GFLOP/s at 30 flops per interaction]

nnunn · January 2, 2013, 5:14pm

Thanks rjl - those 4 data points are very helpful!

enok71 · October 2, 2013, 7:31am

Interesting, I ran this benchmark on a server with Nvidia GTX Titan and got better results than your k20c numbers above. The server is 2x10 core E5-2690-v2 with PciE-v3 and 8x8GB 1866MHz DDR3 RAM. CUDA is 5.5 and Nvidia driver version is 319.49.

nbody -benchmark -numbodies=212992 -device=0
1868.738 single-precision GFLOP/s at 20 flops per interaction

nbody -benchmark -numbodies=212992 -device=0 -fp64
742.626 double-precision GFLOP/s at 30 flops per interaction

Even more remarkable: if I put two GTX Titan I got more than linear speedup:

nbody -benchmark -numbodies=212992 -numdevices=2
7222.349 single-precision GFLOP/s at 20 flops per interaction

nbody -benchmark -numbodies=212992 -numdevices=2 -fp64
2661.470 double-precision GFLOP/s at 30 flops per interaction

Is this for real? Or is this nbody benchmark unreliable?

Topic		Replies	Views
Quick sanity check please! (GTX 1080 performance, CUDA Nbody sample) CUDA Programming and Performance	5	2790	November 25, 2016
GPU performance is very poor General Topics and Other SDKs cuda , performance , windows-driver	0	1095	June 3, 2022
Tesla k20 vs GTX680 benchmarks...!!!!! CUDA Setup and Installation	6	9896	January 28, 2013
Hardware for a high-end development system CUDA Programming and Performance	11	3792	June 26, 2012
GTX295 Specefications & CUDA CUDA Programming and Performance	5	12286	October 7, 2010
Does CUDA+WSL2 work with a GT 710? CUDA on Windows Subsystem for Linux	9	2440	October 12, 2021
Tesla Performance? CUDA Programming and Performance	1	8093	February 8, 2010
GTX 470 vs GTX 295 benchmark using sdk examples comparison between GTX 470 and GTX 295 in sdk 2.2 2. CUDA Programming and Performance	15	46611	May 6, 2010
Cuda 7.0 Jetson TX1 performance and benchmarks Jetson TX1	21	17176	March 16, 2017
About CUDA nbody sample performance comparison CUDA Programming and Performance	3	906	July 29, 2023

The performance of Nvidia GTX 650 in nbody example

Related topics