Single & Multi GPU measuring performance increasing ?

lqman · January 4, 2010, 1:46am

Dear Masters,

I’m newbie in GPU computing & want to start my research about performance measuring between single & multiple GPU.

because I still having trouble to create my own programs, so I look for programs that can run on a single or Multi-GPU (optional), if there is any example to test I’m so glad for it.
I’ve tried the example program on CUDA SDK but it seems there is no one can run on single and multi-GPU?

If you have a program that can run on a single optional Multi-GPU or GPU, and need beta tester, I will do it for you.

I really2 need help

Talonman · January 4, 2010, 10:20am

I am with you brother… :)

To me, multi-GPU utilization is a key success factor in the GPU Revolution.

What I have found out…

Multi-GPU use is in general hard to implement, due to each GPU having it’s own memory, and really not talking to the other GPU’s.
How the programmer opts to implement multi-GPU usage in his app, will determine their efficiency, or share of the workload.
Some implement multi-GPU use extremely well, like the people from OptiX.

My favorite (so far) pre-maid app for benchmarking multi-GPU is this:

http://www.ngohq.com/graphic-cards/16920-d…html#post87400

It does give a charming ‘workload share report’, but feel it needs further optimizations to do us Nvidia boys justice.

I too am doing my level best to help Pat (The developer) in any way. ;)

We need a ‘BETA Testers Standing Ready for Nvidia Developers’ section.

I would be a resource for them too.

My favorite multi-GPU program is Mandelbulb. Due to it using the Optix libraries, it does do an outstanding job of using all of your GPU’s installed in the system. External Image

(That’s both GPU’s operating in SLI mode, and Dedicated PhysX mode.)

The CPU workload will also make use of multiple cores if available.

http://forums.nvidia.com/index.php?showtop…50985&st=20

To Download - Post 29, page 2.

My GPU workload distribution with the app running:

I wish more CUDA libraries would just handle that, as well as Optix apparently does.

It is rather rare, to find apps that task multi-GPU’s well…

http://www.evga.com/FORUMS/tm.aspx?m=61852

We need more. ;)

Performance on single GPU’s is still allot of black magic, and largely dependent on the type of calculations the app is doing I believe…

Example:

We can take this OpenCL Ray Tracing program that uses only 1 GPU…

http://www.xtremesystems.org/forums/showthread.php?t=241904

The ATI GPU’s seem to do well on it right?

freeloader ---------- 5850 ------ Sample/sec – 17,298.6K v1.5 (GPU=1007, M=1152)

freeloader ---------- 5850 ------ Sample/sec – 13,719.6K v1.4 (GPU=1007, M=1152)

Toysoldier ---------- 5870 ------ Sample/sec – 13,719.6K v1.4 (GPU=875, M=1300)

fellix bg ------------- 5870 ------ Sample/sec – 13,719.6K v1.4 (GPU=900, M=1250)

safan80 ------------- 5970 ------ Sample/sec – 11,012.8K v1.4 (Unknown)

SocketMan ---------- 5770 ------ Sample/sec — 7,535.1K v1.4 (GPU=950, M=1200)

mattkosem --------- 4890 ------ Sample/sec — 7,520.9K v1.4 (GPU=1056, M=1000)

BeepBeep2 --------- 4850 ------ Sample/sec — 7,172.0K v1.5 (GPU=800, M=2250)

Mechromancer ------ 4870 ------ Sample/sec — 6,955.5K v1.5 (GPU=790, M=900)

PyrO ----------- 1/2 a 4870X2 – Sample/sec — 6,955.5K v1.5 (GPU=790, M=915)

redrumy3 ----------- 4870 ------- Sample/sec — 6,375.8K v1.4 (GPU=875, M=1100)

PyrO ----------- 1/2 a 4870X2 – Sample/sec — 5,796.2K v1.4 (GPU=790, M=915)

NovoRei ------------- 4870 ------ Sample/sec — 5,616.1K v1.4 (512mb, 790mhz)

Talonman -------- 1/2 a 295 ---- Sample/sec — 2,898.1K v1.5 (C=621, SH=1512, M=1152)

Chumbucket843 - GTX 260 ----- Sample/sec — 2,068.7K v1.5 (C=602, SH=1369, M=1159)

Talonman -------- 1/2 a 295 ---- Sample/sec — 1,159.2K v1.4 (C=621, SH=1512, M=1152)

Chumbucket843 - GTX 260 ----- Sample/sec — 1,123.2K v1.4 (C=602, SH=1369, M=1159)

DosDuoNo -------- GTX 260 ----- Sample/sec — 1,093.2K v1.4 (C=655, SH=1125, M=1125)

We take the exact same program, use the ‘RUN_SCENE_SIMPLE_64SIZE’, and 1/2 of my 295 produces 114,154.1K Samples/sec.

http://forum.beyond3d.com/showthread.php?t=55913&page=2

The ATI GPU’s cant get near that running the same SIMPLE_64SIZE .bat file…

Best conclusion so far: The simple scene consists of a lot of “nothing to do” rays, I guess (off to infinity and beyond). This should mean the duration of the kernel is short. The variation with CPU clock seems to suggest that CPU-side stuff is some kind of bottleneck. Finally the lower ATI performance for this scene seems to suggest that kernel launch overhead is higher on ATI.

Another possibility is, due to this post…

http://forums.nvidia.com/index.php?showtopic=154710

We also are toying with the idea that the app might be using alternating memory buffers, that are being updated with blocking enabled causing a spinlock wait.

Problem I’m now thinking with that theroy is, when I generate my 114,154.1K Samples/sec running the SIMPLE_64SIZE scene, my CPU Utilization didn’t drop.

I would have expected it to, yet I still had stunning performance. The screen does indeed get rendered quick!

Bottom line…

Both for single and dual GPU performance, my impression is there is lots of fine tuning required, for GPU accelerated apps to be the best they can be. External Image

However, when you get them dialed in, you can get crazy performance. I also think it is largely going to be app dependent, both on their level of parallelism, and how well the programmer was able to fine tune it.

(I should warn you that I am not a CUDA programmer.)

DYL-280 and BASIC are the only languages that I have messed with.

lqman · January 11, 2010, 8:50pm

thx 4 your answer anyway.

Topic		Replies	Views
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6035	December 8, 2008
Survey of common CUDA runtime profiles What is your application like? CUDA Programming and Performance	11	7502	February 2, 2010
Why is my single thread GPU speed 1000x faster than my CPU? CUDA Programming and Performance	14	4795	January 9, 2017
Performance gap for a short test code between GPU and CPU CUDA Programming and Performance	8	1845	October 26, 2017
Performance problem when loading multiple GPU system with independent simulations CUDA Programming and Performance	11	166	June 28, 2024
tool to calculate the speedup of an application that runs on gpu based heterogeneous computing plat CUDA Programming and Performance	8	1004	November 13, 2014
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6538	February 19, 2009
multiGPU poor performance up to 10x lowest performance in multiGPU CUDA Programming and Performance	14	10763	January 18, 2008
GTX295 multi GPU programming CUDA Programming and Performance	22	10654	July 9, 2009
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8614	December 18, 2008

Single & Multi GPU measuring performance increasing ?

Related topics