CUDA processor allocation

Hi all,

How does CUDA allocate its multiprocessors to different kernels started by multiple host threads? Say I have 2 host threads, and each one loads its own kernel for different tasks. We also know that the kennels only have 8 threads. Is CUDA smart enough to allocate different multiprocessors to different kennels?

In current CUDA version any running kernel occupies all available multiprocessors on card, it is not possible to run more than one kernel in parallel.
So if you have 2 host threads and each one starts its own kernel (I’m not sure you can actually ‘open’ same device from 2 different threads), these kernels WILL NOT run in parallel.

Ouch. We were hoping could be done. A much more effective way to use the card.

yeah, it’d be cool if you could mix a bandwidth-intensive kernel and an arithmetic-intensive one, for example. Or to fire off many quick jobs but not lose efficiency. Or just for convenience. I’m sure the driver could in theory do it, there’s never any dependencies between blocks anyway.

It looks like it can be done. At least with different users on the same system.

The only way I know to make different multiprocessors do different things in parallel is to make the kernel check the blockIdx and do a different task depending on that. Like



else if(blockIdx.x==1)


else ...

But this won’t meet the ‘multiple host threads’ requirement I guess…

About that link: yes, simultaneous CUDA applications are possible, but I don’t think they will really run at once on the multiprocessors. There will be GPU context switching along with CPU context switching, and only one thread can do CUDA stuff at once.

No, it won’t meet the ‘multiple host threads’ requirement. The point of having multiple host threads to launch different kernels is that the host can process the results in a timely fashion.

Different kernel launches, be they from one or multiple host threads, are executed one at a time on the device. While intermingling different kernel launches might seem a like a good idea at first, a number of memory and synchronization issues creep up, bringing efficiency down.

In most cases, rethinking the parallelization approach helps. Perhaps you can have a 3rd host thread, which will be the only host thread communicating with the CUDA device. The other two host threads would then fill out the data structures and signal to the 3rd thread to launch CUDA kernels and memcopies.

Can you describe your application in more detail? If you don’t want to disclose details publicly, you can send me a message.