MultiGPU performance

Im working on porting an existing application to run on multigpus. It already works on a single gpu but i have a 9800gtx and a gtx280 in this pc at work so i might as well see if multigpu can help speeds things up.

Ive pretty much copied the multigpu example in the sdk. I get (way) worse performances though.

The code used to run on the gtx280 alone and the profiler reports 430ms. As a first try, ive split the work load in half between the gtx280 and 9800GTX.
The profiler now reports 1.5sec for half the work being done on the GTX280. Ive also tried to limit the number of GPUs being used to ‘1’, thus basicaly doing all the work on the GTX280 but still using host threads generations and the computation time logicaly raised to around 3 secs.

I dont get why this is much slower than a non cpu threaded application where all the work is done on the GTX280, any ideas?

Im using constant memory and textures defined in another file if this can cause problems… Populating both by each host thread.

From my experience the first thing to do is check what takes you the most time. Use cutil timers, cuda events, hand clock or

any other mean to measure which part of your code takes the most of the time (setDevice, preparedata, copy data to device, kernel,

copy data from GPU). Once you know this, you’ll be able to understand better whats the problem


Strength of the chain is equal to the strength of the weakest link.

One of your cards takes 1.5 secs to complete half of the work and hence the output.

You need to load-balance correctly on a multi-GPU setup. Otherwise you wont get desired results.

It happened to us when using the personal-super computer with 4 TESLAs. The final output was as slow as the little “nView” card installed in that box.

Fixing that, resulted in sooper-dooper performance.

First, have you seen concurrent bandwidth test?

The bw to each card is less than the max bw when they are being used concurrently – something to keep in mind.

Also, are you creating and destroying threads over and over, or are you reusing threads? This can make a big difference.

Thanks for the replies guys.

As was to be expected, my mistake had nothing to do with multigpu code! I was just forgetting to initialize one of the constant in the single gpu configuration, causing the whole kernel execution to bypass the costly branch in the code, thus ending much quicker.
My face is indeed quite red.

Now i just need another GTX280. The 9800GTX is slower at doing half the work than the GTX280 is at doing the whole work, so this is quite useless :)