cuda with multicore (multitasking) multicore CPU(for multitasking) and CUDA

Blokkolnam · August 29, 2008, 9:02am

Hi All,

We have a INTEL core 2 quadro (recently with linux op. sys). If we put a video card (or a Tesla) in our PC, can we use the CUDA device in multitasking mode? I mean that, I run for instance four simulations (with different parameters for example) on the 4 CPU cores at the same time, can the four CPU code (including some CUDA codes) use the same CUDA device shared or something? Or this works only if we have 4 NVIDIA cards?
Thanks in advance,
Best Regards,

AndrÃ¡s

theMarix · August 29, 2008, 10:37am

It should work as long as you don’t run out of memory on device.

Blokkolnam · August 29, 2008, 12:18pm

Okey, and what about when the 4 CPU codes try to call 4 kernels at the same time? If they fit in GPU memory, can 4 kernels run simultaneous ?

Or the 4 kernels of the 4 CPU apps have to wait for each other to finish execution on the GPU?

I read this in the 1.0 FAQ:

Because if the kernels cannot run at the same time, the 4 CPU tasks have to wait for each other. But I guess even in this case there is a good speed up compared to simple CPU multitasking what we use now…

Thanks much,

Regards,

AndrÃ¡s

MisterAnderson42 · August 29, 2008, 12:36pm

It works best if you have 4 video cards. I wouldn’t recommend running more than 1 app on a single GPU at a time for performance reasons. If I run 2 instances of my app on the same GPU, each instance runs 1/4 the speed of what it would as a single instance: so the loss of performance is ~1/2.

Blokkolnam · August 29, 2008, 5:39pm

I understand it now, thanks very much,

regards,

AndrÃ¡s

alexao · February 20, 2009, 1:56pm

I am bumping this thread as I am doing some experiments on running apps concurrently on one device. Is there a reason why two instances of the same application or two different applications running concurrently should kill performance? Apart from the different applications blocking when doing cudaMemcpy?

MisterAnderson42 · February 20, 2009, 2:30pm

Context switching is expensive.

Realize that you have resurrected a very old post. I believe the overhead is less now in CUDA 2.1. I just ran 2 benchmark jobs on one 8800 GTX and the performance dropped from 191 steps/s to 90 steps/s. This is a code that is running almost fully on the GPU with very little memcpy host<->device.

alexao · February 20, 2009, 9:02pm

I realize that it is an old thread, the reason why I am bumping. Context switching is usually expensive, but I am not sure how it is performed on CUDA, one of the reasons why I am asking questions about it. Is it the device driver that has to divide what applications are given thread blocks to the SIMT unit?

alexao · February 23, 2009, 1:09pm

Okay, I have done some pretty simple tests to see how concurrent requests to the CUDA driver is performed. It seems that they are serialized by the driver, and this way makes the application think that they run concurrent, giving a slower total runtime. I do however not get any penalty when the application first is scheduled by the driver to run on the GPU, then I get the same gputime as when I ran the application isolated.

Is this the same kind of results you have gotten MrAnderson? If not I am interested to know how you have benchmarked

MisterAnderson42 · February 23, 2009, 4:10pm

I benchmarked this just as I said in the previous post. I ran two instances of the same HOOMD benchmark at once. In CUDA 2.1 the performance of each benchmark dropped from 191 to 90 => that performance per process is 94% of what it would be in an ideal world with no context switching overhead.

I’ve never tried to isolate what the driver does myself, I just took it as a given that the driver would serialize the calls. In the HPC world, nobody will every run more than one process/host thread per GPU so I never have really cared.

alexao · February 23, 2009, 4:29pm

okay, thanks for your input.

I am looking at CUDA from a GPGPU perspective, so for me it is interesting to know. Seems the processes are context switched normally until a kernel launch occurs, here the first launch that occurs blocks the CPU until completion before letting the next process launch. So indirectly the driver accepts only one request at the time.

tmurray · February 23, 2009, 5:33pm

Blocks the CPU until completion? No, that’s not true (or certainly shouldn’t be). It will consume the GPU until it’s completed, that’s true. The driver will still queue plenty of kernel/memcpy calls while the GPU is in use.

alexao · February 23, 2009, 7:20pm

Yes, I did not mean block as in not allow anymore launches or memory operations towards the GPU, I meant that it blocks meaning no other CUDA processes will get a chance to perform any tasks until the current task on the GPU is completed. Is this a correct understanding?

tmurray · February 23, 2009, 9:14pm

Not necessarily–you can still overlap kernels and memcpys implicitly from different contexts (sometimes).

Topic		Replies	Views
Concurrent execution of more than one CUDA application CUDA Programming and Performance	5	2977	May 1, 2009
Threaded CUDA Multiple concurrent kernels? CUDA Programming and Performance	9	5592	October 20, 2009
CUDA processor allocation CUDA Programming and Performance	7	3434	October 5, 2007
Using CUDA to run many instances CUDA Programming and Performance	10	3245	April 1, 2012
Invoking kernel from multiple PC processes CUDA Programming and Performance	1	5500	June 3, 2011
My first test on CUDA and some questions sync, thread with CUDA CUDA Programming and Performance	5	3019	November 13, 2007
Multi-user-systems und multi-gpu-usage CUDA Programming and Performance	9	6176	July 15, 2008
Utilization of SMs in a GPU CUDA Programming and Performance	3	9314	July 4, 2010
Multiple GPUs, multiple applications CUDA Programming and Performance	10	9991	April 22, 2009
Is it possible using muliple context for a GPU. mulitple CPU thread CUDA Programming and Performance	10	4847	April 8, 2009

cuda with multicore (multitasking) multicore CPU(for multitasking) and CUDA

Related topics