Threaded CUDA Multiple concurrent kernels?

Howsit guys

I was wondering what the support is for threaded APPs running different independent kernels concurrently, esp. on win32 platforms. If you can point me to a link, that would also be great!

Creating multiple host threads is the only way to use multiple GPUs at the same time in CUDA, so this is well supported. I believe there is a multiGPU example in the SDK projects directory which demonstrates this.

If you want to have multiple host threads (or processes) access the same GPU, that will also work, but there is an efficiency cost. Each kernel “owns” all the multiprocessors on the GPU while it is running until it completes. If two threads are queuing up kernel launches, the driver will have to alternate between host threads each time a kernel finishes running. There is some overhead in the context switching, which means your total throughput will be lower if you use multiple threads or processes on one GPU. (I don’t have concrete numbers for this, but it has definitely gotten better with newer CUDA releases.)

That said, if each host thread is using the GPU less than 50% of the time, the context switching overhead might not be noticeable. I have no problem running 4 processes using one GPU when each process is only using the GPU 20% of the time.

(Disclaimer: All of my experience is on Linux. The context switching overhead could be different for Windows.)

Very informative! Thank-you

Isn’t there a way to run multiple kernels in parallel on the same GPU?

Not on any of the hardware you can buy today.

Well, you can always do something like:

__global__ void myForkingKernel(float *a,float *b, float *c)


  if(blockIdx.x < 2)





But all functions are inlined, so each thread will use registers for variables from both first and second “subkernels”

I think more accuratly the resources required will be the maximum of the resources required for each “subkernel”. If the first one requires 16 registers and 2000 bytes of smem, and the 2nd requires 18 registers and 300 bytes of smem the combined kernel will require 18 registers and 2000 bytes of smem (all per thread).

it will probably use sum of shared memory - so 18 regs and 2300b smem.

You can template the kernel - the compiler should create two (or more as you request) versions of your kernel

each with only the amount of needed resources.

You can look at the SDK samples for templated kernels.


You could, but that won’t allow you to run multiple independent subkernels concurrently so isn’t really relavent to this thread.