multi-GPU parallel operation


I just started testing on a dual-GPU box (Dell H2C, with dual 8800GTX cards, 680i chipset, running ubuntu 32-bit), and I can’t seem to get both GPUs to work in parallel. I have two threads, each working on a compute-bound task on different device. However, the behavior I get is similar to what I see with a single GPU; each thread is busy about half the time, and appears to be waiting for the GPU for the other half.

I did call cudaSetDevice() first, and I called cudaGetDevice() to confirm that each thread has a different device number. The task involves virtually no I/O off the card, although there are a number of device-to-device copies and many different kernels running in sequence.

Is there any way to determine whether the two threads are really executing on two GPUs, as opposed to sharing one?

Are there any calls I could be making that inadvertently make one thread wait for the other?

If anyone has suggestions, please let me know.

I’d suggest querying the device properties for the two GPUs and verifying that you’re getting the props that match the two GPUs you’re using (if they aren’t identical, it’d be easy to tell).

Another thing to check is whether your threads code has a mutex in it that’s effectively preventing

both from running concurrently? Another trick would be to have your first thread allocate all of the GPU memory, then query print the amount of free GPU memory in both threads as a means of

determining if they got to the correct device, etc. Some of these you’d have to do with the driver API, but you get the idea…


John Stone

Thanks for the suggestions. I have confirmed via cuGetMemoryInfo() that the two threads are allocating their memory on different devices. So it looks like the execution is split between the two GPUs as it’s supposed to be, except that the two GPUs aren’t both processing at the same time.

This would seem to indicate an inter-thread sync issue, but my two threads do not communicate with each other at all–they just get created at the start of execution and process separately, on two different data sets. That makes me think that CUDA is doing some kind of synchronization behind the scenes. Do any of the CUDA utility functions or macros (like cudaThreadSynchronize() or CUDA_SAFE_CALL()) actually sync across both GPUs instead of just the context from which they’re called? Or is this perhaps a driver issue?

There shouldn’t be syncing between CUDA contexts. Can you try your code without cutil macros? A while ago I was able to drive 4 GPUs (2 Tesla D870) from a single box and didn’t run into the problem you’re describing.


I don’t have any of these problems either, and I’ve been doing multi-GPU code for over a year starting with early beta versions. I presume there’s something specific to his kernel that’s creating the problem…