Language confusion with multi-gpu


I’m working on porting a little app we are working on from working with one GPU to multiple GPUs. I’ve run into some strangenesses with CUDA.

I understand that I can use cudaSetDevice to change what device is currently being used, however this does not seem sufficient.

I can create a thread that controls each card, however this is wasteful as the CPU could be doing other things besides controlling a GPU. When I attempt to create a thread that controls multiple GPUs (by calling cudaSetDevice repeatedly) I run into trouble… I receive incorrect results from each of the GPUs being used. Can cudaSetDevice be used in this way?

I can not give out our code, but I can give an idea of what we are doing:

for (x=0;x<num_gpus;x++) {






for (x=0;x<num_gpus;x++) {




for (x=0;x<num_gpus;x++) {






Also, from a language standpoint how would I create a single application that allocates different amounts of shared memory on each GPU?


You really need to create a thread per GPU, I don’t think there is a way around it and it will not be wasteful: threads that are waiting for the GPU all the time hardly consume resources.
Switching context all the time with cudaSetDevice takes quite some extra time I think.

Hi Wumpus, in my experience, when the CPU thread is waiting for the GPU, it is 100% busy. An example is:

  1. Launch kernel

  2. cudaThreadSynchronize()

Another example:

  1. Launch kernel 1

  2. Launch kernel 2 or use cudaMemcpy (CUDA implicitly invokes cudaThreadSynchronize())

CUDA seems to use some form of busy-wait that uses 100% of the CPU when it is waiting for the GPU. I’d be interested to hear if you or anyone else has a way around this problem.

I was assuming this was fixed in Cuda 1.0, as kernels are launched asynchronously. If it still does busy waiting I see no way around it.

For what I have tested, the kernels are indeed launched asynchronously.

But the synchronization that occurs on the next call to the device uses 100% cpu.

I.e you can do some amount of processing in parallel on the CPU and the GPU, if you can make the work on the GPU and the CPU last about the same time.

But allowing a thread to sleep until the device has completed its task would be a nice feature…

What one can do then I guess is not call cudaThreadSynchronize() after launching the kernel directly.

Let’s say you can guess how long the kernel will work. Then you call the kernel, let the thread sleep and when you think the kernel is finished call cudaThreadSynchronize() to make sure it is.

Don’t think of cudaThreadSynchronize() as a method to wait for the kernel but as a method to make sure that the kernel finished and that the kernel’s results are waiting for you to pick up.

I guess that’s basically what you’re saying… :-)

The next version of CUDA will make the GPU even less synced to the CPU. This might make it easier to use multiple cards from one thread.

What NVIDIA should do, however, is make a version of SLI for CUDA (in addition to making it easier to control multiple cards from one thread). Since most of the Tesla products connect many cards to one server, shouldn’t this be a high-priority category of features??

Personally I feel that this would be a bad way to go. Unless the bus between cards allows for same-speed memory reads from PEs of either card to memory of the other card it would introduce some amount of NUMAness to the video card. Right now, one of the things that is going for the video card is uniform fast well-hidden memory accesses.

Instead, it would be nice to add a way to split a program across multiple video cards in a more language centric way. A programmer could specify a split in the data being used by threads and blocks in such a way that it could be split across different cards.

Agreed. I applaud nvidia for making the unsuual but good choice of creating a slightly new language rather than implementing an api that is traditional but complex (or rather, I applaud it for doing both). A language-centric method for using multiple cards would be a nice continuation of that innovation.

CUDA SLI could be NUMA, in which case it’s still better than nothing. It could also be like graphics SLI, where each card keeps a copy of memory. Read performance would then be stellar. Writes aren’t ever coherent anyway, so that concern nicely falls away too.

Well, having to “guess how long the kernel will work” just sucks.

It is neither efficient nor practical.

If nvidia cannot provide a sleeping synchronization function (which would probably require an interrupt to be sent by the GPU), a simple function to check whether the work is finished would be a first step.

A sleeping synchronization function would introduce delays between the time that the kernel finishes executing on the GPU and the time that the finished condition is rechecked. The current busy wait functionality is the best in terms of maximum GPU performance with short kernels.

I would recommend setting your thread priority low to allow other threads to do some work while the synchronization busy wait runs. Certainly a non-blocking polling function would be nice, though.

Why would a non-blocking polling function be more efficient in terms of introducing delays than a sleeping synchronization function? In my opinion the non-blocking polling would be less efficient in terms of delay and CPU load. And as you describe the sleeping sync as “rechecking a finished condition” it seems to me that this is exactly the same thing as non-blocking polling.