multiprocessor cores and SIMT

Newbie question here (I am probably asking the wrong thing). A grid of threadblocks always lives on a single multiprocessor, and any block must live on the same core I assume. Can a multiprocessor be running more than one kernel at a time (in parallel, on different cores)? If it is doing so is there a significant performance hit, is it as if the cores were basically taking turns? Can a core be running more than one kernel and if so what does it mean for performance?


AFAIK, the whole GPU device can execute only one kernel at once. A good example: when your kernel is holding the GPU in a few seconds, you cannot see the mouse dragged.

The grid of blocks is distributed over all multiprocessors in the device. (30, in the case of the GTX 275, 280, and 285)

A block is confined to run on one multiprocessor so that the threads in the block can communicate via the shared memory in that multiprocessor.

No, but a multiprocessor can run more than one block (up to 8) from the same kernel at once. The limit here is the shared memory and register usage of the block. If each block requires 5000 bytes of shared memory and 2304 registers, then 3 blocks will run at a time on each multiprocessor. (In this example, the shared memory is the resource limit.) If there are more blocks than can fit on all the multiprocessors in the device, the extra blocks sit in a queue and wait for active blocks to finish.

Although you can’t run more than one kernel at a time, you can run a “fat kernel” with no penalty. A fat kernel has a big if statement at the beginning that goes into completely different subroutines. As long as the branch depends on block ID number, there is no penalty for this:

__global__ void my_fat_kernel()


  if (blockIdx.x < 128)





In fact, you can branch on warp boundaries (threads 0-31 take one branch, threads 32-63 take another branch, and so on), also with no penalty. Of course, this is a really cumbersome way to make the GPU do two different things at once. You also run into the problem that if do_something1() takes more time than do_something2(), you have to wait until both are done before starting another kernel.

Mostly, the fat kernel is to illustrate that the hardware probably could (or is at least very close to) running multiple kernels at once, but there is no software interface to do so. You should probably not try to use the fat kernel technique given its limitations, though. Usually it is better to run kernels sequentially instead.

The ability to truly run multiple kernels at once has been requested in the forums a few times, but we haven’t had any confirmation that NVIDIA is working on it. This could mean anything from “It is impossible” to “You’ll see it in the next release.” NVIDIA doesn’t do roadmaps. :)