Performance with multiGPU ... and the 9800 GX2.

Hi guys.

I am running the multiGPU example in SDK on a 9800 GX2 card. Note that this card houses two GPUs, but fits in a single PCIe slot. The performance numbers I’m getting from the demo app are:

1 GPU: 596.4 (ms)
2 GPUs: 619.3 (ms)

These numbers don’t make logical sense. I would think that using both GPUs in parallel would roughly cut processing time in half. These numbers suggest that the GPUs are executing in series, rather than parallel. Could this be because they both share the same PCIe interface? Or could it be for another reason? I wonder what the numbers would look like if this were executed on two physically separate cards, like a pair of 8800 GTs, for example?

Best,

  • Kor

Which multiGPU example is this? I see a simpleMultiGPU and a MonteCarloMultiGPU in the SDK. I can try either on my pair of 8800 GTX cards, if you’d like. They are both PCI-Express 1.0 cards, but they are installed in a motherboard with dedicated 16x links to both slots, so they’ll run at full bus speed.

I was referring to the regular multiGPU one, which might be old. However, simpleMultiGPU exhibits the exact same behavior. These are my GPU processing times for simpleMultiGPU:

1 GPU: 112.4 (ms)
2 GPUs: 153.6 (ms)

Note that for the 1-GPU case, I hard-coded the GPU_N variable to 1, immediately after the CUDA_SAFE_CALL(cudaGetDeviceCount(&GPU_N)) instruction.

Please try the same test on your pair of GTX’s and let me know what your results are. Many thanks in advance.

  • Kor

1 GPU = 139 ms
2 GPU = 185 ms

Reading through the code, I suspect the reason for this difference is that the timer also includes the startup time for each thread. The amount of data (DATA_N) being processed by default is only 32 MB (16 MB for each card with 2 GPUs), which is quite small. This means the overhead from CPU thread creation is substantial.

If you modify the data size to be 128 MB total, then you get this timing:

1 GPU = 325 ms
2 GPU = 309 ms

(Notice this isn’t even a linear scaling from 32 MB. Lots of fixed overhead buried in here…)

Going much larger than that seems to make the GPU kernel fail to run due to hard-coded grid and block sizes.

Anyway, it seems simpleMultiGPU is not really good for benchmarking because it spawns two CPU threads and each only makes a single, relatively short, kernel call. Of course, doing more than that would not make it very simple. :)

Interesting. I re-ran the test with the following modification:

DATA_N = 1048576*32

to

DATA_N = 1048576*112

which yielded:

1 GPU: 594 ms

2 GPUs: 317 ms

I wasn’t able to go up to 128, since I got an error with anything over 112. Also, the comments on the top of the file say:

* Creating CPU threads has a certain overhead. So, this is only worth when you

 * have a significant amount of work to do per thread. It's also recommended to

 * create a pool of threads and reuse them to avoid this overhead.

I guess they weren’t kidding about this! Now the question is: how to actually go about creating such pool of threads and reusing them? This might be material for a new post, unless we continue here.