Language confusion with multi-gpu

Jeff_hagen · October 1, 2007, 10:54pm

Hello,

I’m working on porting a little app we are working on from working with one GPU to multiple GPUs. I’ve run into some strangenesses with CUDA.

I understand that I can use cudaSetDevice to change what device is currently being used, however this does not seem sufficient.

I can create a thread that controls each card, however this is wasteful as the CPU could be doing other things besides controlling a GPU. When I attempt to create a thread that controls multiple GPUs (by calling cudaSetDevice repeatedly) I run into trouble… I receive incorrect results from each of the GPUs being used. Can cudaSetDevice be used in this way?

I can not give out our code, but I can give an idea of what we are doing:

for (x=0;x<num_gpus;x++) {

   cudaSetDevice(x);

   cudaMemcpy(memory_pointer[x],

                      some_memory_in_main_memory,

                      length,cudaMemcpyHostToDevice);

}

for (x=0;x<num_gpus;x++) {

  cudaSetDevice(x);

  cuda_function<<<num_blocks,num_threads>>>(memory_pointer[x]);

}

for (x=0;x<num_gpus;x++) {

  cudaSetDevice(x);

  cudaMemcpy(some_memory_in_main_memory,

                     memory_pointer[x],

                     length,cudaMemcpyHostToDevice);

}

Also, from a language standpoint how would I create a single application that allocates different amounts of shared memory on each GPU?

Thanks!

wumpus · October 2, 2007, 7:52am

You really need to create a thread per GPU, I don’t think there is a way around it and it will not be wasteful: threads that are waiting for the GPU all the time hardly consume resources.
Switching context all the time with cudaSetDevice takes quite some extra time I think.

sphyraena · October 2, 2007, 12:01pm

Hi Wumpus, in my experience, when the CPU thread is waiting for the GPU, it is 100% busy. An example is:

Launch kernel
cudaThreadSynchronize()

Another example:

Launch kernel 1
Launch kernel 2 or use cudaMemcpy (CUDA implicitly invokes cudaThreadSynchronize())
…

CUDA seems to use some form of busy-wait that uses 100% of the CPU when it is waiting for the GPU. I’d be interested to hear if you or anyone else has a way around this problem.

wumpus · October 2, 2007, 1:21pm

I was assuming this was fixed in Cuda 1.0, as kernels are launched asynchronously. If it still does busy waiting I see no way around it.

Simon · October 3, 2007, 12:40pm

For what I have tested, the kernels are indeed launched asynchronously.

But the synchronization that occurs on the next call to the device uses 100% cpu.

I.e you can do some amount of processing in parallel on the CPU and the GPU, if you can make the work on the GPU and the CPU last about the same time.

But allowing a thread to sleep until the device has completed its task would be a nice feature…

seb · October 3, 2007, 2:14pm

What one can do then I guess is not call cudaThreadSynchronize() after launching the kernel directly.

Let’s say you can guess how long the kernel will work. Then you call the kernel, let the thread sleep and when you think the kernel is finished call cudaThreadSynchronize() to make sure it is.

Don’t think of cudaThreadSynchronize() as a method to wait for the kernel but as a method to make sure that the kernel finished and that the kernel’s results are waiting for you to pick up.

I guess that’s basically what you’re saying… :-)

alex_dubinsky · October 3, 2007, 7:20pm

The next version of CUDA will make the GPU even less synced to the CPU. This might make it easier to use multiple cards from one thread.

What NVIDIA should do, however, is make a version of SLI for CUDA (in addition to making it easier to control multiple cards from one thread). Since most of the Tesla products connect many cards to one server, shouldn’t this be a high-priority category of features??

Jeff_hagen · October 3, 2007, 8:12pm

Personally I feel that this would be a bad way to go. Unless the bus between cards allows for same-speed memory reads from PEs of either card to memory of the other card it would introduce some amount of NUMAness to the video card. Right now, one of the things that is going for the video card is uniform fast well-hidden memory accesses.

Instead, it would be nice to add a way to split a program across multiple video cards in a more language centric way. A programmer could specify a split in the data being used by threads and blocks in such a way that it could be split across different cards.

alex_dubinsky · October 3, 2007, 9:55pm

Agreed. I applaud nvidia for making the unsuual but good choice of creating a slightly new language rather than implementing an api that is traditional but complex (or rather, I applaud it for doing both). A language-centric method for using multiple cards would be a nice continuation of that innovation.

CUDA SLI could be NUMA, in which case it’s still better than nothing. It could also be like graphics SLI, where each card keeps a copy of memory. Read performance would then be stellar. Writes aren’t ever coherent anyway, so that concern nicely falls away too.

Simon · October 4, 2007, 8:27am

Well, having to “guess how long the kernel will work” just sucks.

It is neither efficient nor practical.

If nvidia cannot provide a sleeping synchronization function (which would probably require an interrupt to be sent by the GPU), a simple function to check whether the work is finished would be a first step.

Paeva · October 30, 2007, 6:31pm

A sleeping synchronization function would introduce delays between the time that the kernel finishes executing on the GPU and the time that the finished condition is rechecked. The current busy wait functionality is the best in terms of maximum GPU performance with short kernels.

I would recommend setting your thread priority low to allow other threads to do some work while the synchronization busy wait runs. Certainly a non-blocking polling function would be nice, though.

seb · October 30, 2007, 7:16pm

Why would a non-blocking polling function be more efficient in terms of introducing delays than a sleeping synchronization function? In my opinion the non-blocking polling would be less efficient in terms of delay and CPU load. And as you describe the sleeping sync as “rechecking a finished condition” it seems to me that this is exactly the same thing as non-blocking polling.

Topic		Replies	Views
multi-GPU parallel operation CUDA Programming and Performance	4	4089	May 1, 2008
Multiple GPU computing CUDA Programming and Performance	8	7964	May 7, 2008
GPU-CPU & GPU-GPU synchronization query on advanced CUDA features CUDA Programming and Performance	12	17544	June 14, 2008
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4271	May 13, 2010
How to start 2 kernels on 2 devices CUDA Programming and Performance	16	10643	January 7, 2009
My first test on CUDA and some questions sync, thread with CUDA CUDA Programming and Performance	5	3100	November 13, 2007
Cuda confusions a few clarifications on the programming methodology CUDA Programming and Performance	5	1515	October 1, 2011
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23518	October 7, 2010
multi-GPU in cuda 4 CUDA Programming and Performance	4	1055	September 9, 2011
Multi-GPU with a single thread and driver API? CUDA Programming and Performance	5	5057	July 25, 2008

Language confusion with multi-gpu

Related topics