cuda with multicore (multitasking) multicore CPU(for multitasking) and CUDA

Hi All,

We have a INTEL core 2 quadro (recently with linux op. sys). If we put a video card (or a Tesla) in our PC, can we use the CUDA device in multitasking mode? I mean that, I run for instance four simulations (with different parameters for example) on the 4 CPU cores at the same time, can the four CPU code (including some CUDA codes) use the same CUDA device shared or something? Or this works only if we have 4 NVIDIA cards?
Thanks in advance,
Best Regards,


It should work as long as you don’t run out of memory on device.

Okey, and what about when the 4 CPU codes try to call 4 kernels at the same time? If they fit in GPU memory, can 4 kernels run simultaneous ?

Or the 4 kernels of the 4 CPU apps have to wait for each other to finish execution on the GPU?

I read this in the 1.0 FAQ:

Because if the kernels cannot run at the same time, the 4 CPU tasks have to wait for each other. But I guess even in this case there is a good speed up compared to simple CPU multitasking what we use now…

Thanks much,



It works best if you have 4 video cards. I wouldn’t recommend running more than 1 app on a single GPU at a time for performance reasons. If I run 2 instances of my app on the same GPU, each instance runs 1/4 the speed of what it would as a single instance: so the loss of performance is ~1/2.

I understand it now, thanks very much,



I am bumping this thread as I am doing some experiments on running apps concurrently on one device. Is there a reason why two instances of the same application or two different applications running concurrently should kill performance? Apart from the different applications blocking when doing cudaMemcpy?

Context switching is expensive.

Realize that you have resurrected a very old post. I believe the overhead is less now in CUDA 2.1. I just ran 2 benchmark jobs on one 8800 GTX and the performance dropped from 191 steps/s to 90 steps/s. This is a code that is running almost fully on the GPU with very little memcpy host<->device.

I realize that it is an old thread, the reason why I am bumping. Context switching is usually expensive, but I am not sure how it is performed on CUDA, one of the reasons why I am asking questions about it. Is it the device driver that has to divide what applications are given thread blocks to the SIMT unit?

Okay, I have done some pretty simple tests to see how concurrent requests to the CUDA driver is performed. It seems that they are serialized by the driver, and this way makes the application think that they run concurrent, giving a slower total runtime. I do however not get any penalty when the application first is scheduled by the driver to run on the GPU, then I get the same gputime as when I ran the application isolated.

Is this the same kind of results you have gotten MrAnderson? If not I am interested to know how you have benchmarked

I benchmarked this just as I said in the previous post. I ran two instances of the same HOOMD benchmark at once. In CUDA 2.1 the performance of each benchmark dropped from 191 to 90 => that performance per process is 94% of what it would be in an ideal world with no context switching overhead.

I’ve never tried to isolate what the driver does myself, I just took it as a given that the driver would serialize the calls. In the HPC world, nobody will every run more than one process/host thread per GPU so I never have really cared.

okay, thanks for your input.

I am looking at CUDA from a GPGPU perspective, so for me it is interesting to know. Seems the processes are context switched normally until a kernel launch occurs, here the first launch that occurs blocks the CPU until completion before letting the next process launch. So indirectly the driver accepts only one request at the time.

Blocks the CPU until completion? No, that’s not true (or certainly shouldn’t be). It will consume the GPU until it’s completed, that’s true. The driver will still queue plenty of kernel/memcpy calls while the GPU is in use.

Yes, I did not mean block as in not allow anymore launches or memory operations towards the GPU, I meant that it blocks meaning no other CUDA processes will get a chance to perform any tasks until the current task on the GPU is completed. Is this a correct understanding?

Not necessarily–you can still overlap kernels and memcpys implicitly from different contexts (sometimes).