Single & Multi GPU measuring performance increasing ?

Dear Masters,

I’m newbie in GPU computing & want to start my research about performance measuring between single & multiple GPU.

because I still having trouble to create my own programs, so I look for programs that can run on a single or Multi-GPU (optional), if there is any example to test I’m so glad for it.
I’ve tried the example program on CUDA SDK but it seems there is no one can run on single and multi-GPU?

If you have a program that can run on a single optional Multi-GPU or GPU, and need beta tester, I will do it for you.

I really2 need help

I am with you brother… :)

To me, multi-GPU utilization is a key success factor in the GPU Revolution.

What I have found out…

  1. Multi-GPU use is in general hard to implement, due to each GPU having it’s own memory, and really not talking to the other GPU’s.

  2. How the programmer opts to implement multi-GPU usage in his app, will determine their efficiency, or share of the workload.

  3. Some implement multi-GPU use extremely well, like the people from OptiX.

My favorite (so far) pre-maid app for benchmarking multi-GPU is this:…html#post87400

It does give a charming ‘workload share report’, but feel it needs further optimizations to do us Nvidia boys justice.

I too am doing my level best to help Pat (The developer) in any way. ;)

We need a ‘BETA Testers Standing Ready for Nvidia Developers’ section.

I would be a resource for them too.

My favorite multi-GPU program is Mandelbulb. Due to it using the Optix libraries, it does do an outstanding job of using all of your GPU’s installed in the system. :heart:

(That’s both GPU’s operating in SLI mode, and Dedicated PhysX mode.)

The CPU workload will also make use of multiple cores if available.…50985&st=20

To Download - Post 29, page 2.

My GPU workload distribution with the app running:

I wish more CUDA libraries would just handle that, as well as Optix apparently does.

It is rather rare, to find apps that task multi-GPU’s well…

We need more. ;)

Performance on single GPU’s is still allot of black magic, and largely dependent on the type of calculations the app is doing I believe…


We can take this OpenCL Ray Tracing program that uses only 1 GPU…

The ATI GPU’s seem to do well on it right?

freeloader ---------- 5850 ------ Sample/sec – 17,298.6K v1.5 (GPU=1007, M=1152)

freeloader ---------- 5850 ------ Sample/sec – 13,719.6K v1.4 (GPU=1007, M=1152)

Toysoldier ---------- 5870 ------ Sample/sec – 13,719.6K v1.4 (GPU=875, M=1300)

fellix bg ------------- 5870 ------ Sample/sec – 13,719.6K v1.4 (GPU=900, M=1250)

safan80 ------------- 5970 ------ Sample/sec – 11,012.8K v1.4 (Unknown)

SocketMan ---------- 5770 ------ Sample/sec — 7,535.1K v1.4 (GPU=950, M=1200)

mattkosem --------- 4890 ------ Sample/sec — 7,520.9K v1.4 (GPU=1056, M=1000)

BeepBeep2 --------- 4850 ------ Sample/sec — 7,172.0K v1.5 (GPU=800, M=2250)

Mechromancer ------ 4870 ------ Sample/sec — 6,955.5K v1.5 (GPU=790, M=900)

PyrO ----------- 1/2 a 4870X2 – Sample/sec — 6,955.5K v1.5 (GPU=790, M=915)

redrumy3 ----------- 4870 ------- Sample/sec — 6,375.8K v1.4 (GPU=875, M=1100)

PyrO ----------- 1/2 a 4870X2 – Sample/sec — 5,796.2K v1.4 (GPU=790, M=915)

NovoRei ------------- 4870 ------ Sample/sec — 5,616.1K v1.4 (512mb, 790mhz)

Talonman -------- 1/2 a 295 ---- Sample/sec — 2,898.1K v1.5 (C=621, SH=1512, M=1152)

Chumbucket843 - GTX 260 ----- Sample/sec — 2,068.7K v1.5 (C=602, SH=1369, M=1159)

Talonman -------- 1/2 a 295 ---- Sample/sec — 1,159.2K v1.4 (C=621, SH=1512, M=1152)

Chumbucket843 - GTX 260 ----- Sample/sec — 1,123.2K v1.4 (C=602, SH=1369, M=1159)

DosDuoNo -------- GTX 260 ----- Sample/sec — 1,093.2K v1.4 (C=655, SH=1125, M=1125)

We take the exact same program, use the ‘RUN_SCENE_SIMPLE_64SIZE’, and 1/2 of my 295 produces 114,154.1K Samples/sec.

The ATI GPU’s cant get near that running the same SIMPLE_64SIZE .bat file…

Best conclusion so far: The simple scene consists of a lot of “nothing to do” rays, I guess (off to infinity and beyond). This should mean the duration of the kernel is short. The variation with CPU clock seems to suggest that CPU-side stuff is some kind of bottleneck. Finally the lower ATI performance for this scene seems to suggest that kernel launch overhead is higher on ATI.

Another possibility is, due to this post…

We also are toying with the idea that the app might be using alternating memory buffers, that are being updated with blocking enabled causing a spinlock wait.

Problem I’m now thinking with that theroy is, when I generate my 114,154.1K Samples/sec running the SIMPLE_64SIZE scene, my CPU Utilization didn’t drop.

I would have expected it to, yet I still had stunning performance. The screen does indeed get rendered quick!

Bottom line…

Both for single and dual GPU performance, my impression is there is lots of fine tuning required, for GPU accelerated apps to be the best they can be. :shock:

However, when you get them dialed in, you can get crazy performance. I also think it is largely going to be app dependent, both on their level of parallelism, and how well the programmer was able to fine tune it.

(I should warn you that I am not a CUDA programmer.)

DYL-280 and BASIC are the only languages that I have messed with.

thx 4 your answer anyway.