multiprocessor cores and SIMT

johnmortal · July 12, 2009, 7:17am

Newbie question here (I am probably asking the wrong thing). A grid of threadblocks always lives on a single multiprocessor, and any block must live on the same core I assume. Can a multiprocessor be running more than one kernel at a time (in parallel, on different cores)? If it is doing so is there a significant performance hit, is it as if the cores were basically taking turns? Can a core be running more than one kernel and if so what does it mean for performance?

-John

cvnguyen · July 12, 2009, 11:34am

AFAIK, the whole GPU device can execute only one kernel at once. A good example: when your kernel is holding the GPU in a few seconds, you cannot see the mouse dragged.

seibert · July 12, 2009, 1:57pm

The grid of blocks is distributed over all multiprocessors in the device. (30, in the case of the GTX 275, 280, and 285)

A block is confined to run on one multiprocessor so that the threads in the block can communicate via the shared memory in that multiprocessor.

No, but a multiprocessor can run more than one block (up to 8) from the same kernel at once. The limit here is the shared memory and register usage of the block. If each block requires 5000 bytes of shared memory and 2304 registers, then 3 blocks will run at a time on each multiprocessor. (In this example, the shared memory is the resource limit.) If there are more blocks than can fit on all the multiprocessors in the device, the extra blocks sit in a queue and wait for active blocks to finish.

Although you can’t run more than one kernel at a time, you can run a “fat kernel” with no penalty. A fat kernel has a big if statement at the beginning that goes into completely different subroutines. As long as the branch depends on block ID number, there is no penalty for this:

__global__ void my_fat_kernel()

{

  if (blockIdx.x < 128)

	do_something1();

  else

	do_something2();

}

In fact, you can branch on warp boundaries (threads 0-31 take one branch, threads 32-63 take another branch, and so on), also with no penalty. Of course, this is a really cumbersome way to make the GPU do two different things at once. You also run into the problem that if do_something1() takes more time than do_something2(), you have to wait until both are done before starting another kernel.

Mostly, the fat kernel is to illustrate that the hardware probably could (or is at least very close to) running multiple kernels at once, but there is no software interface to do so. You should probably not try to use the fat kernel technique given its limitations, though. Usually it is better to run kernels sequentially instead.

The ability to truly run multiple kernels at once has been requested in the forums a few times, but we haven’t had any confirmation that NVIDIA is working on it. This could mean anything from “It is impossible” to “You’ll see it in the next release.” NVIDIA doesn’t do roadmaps. :)

Topic		Replies	Views
Multiple Kernels CUDA Programming and Performance	3	2765	March 6, 2008
Simple Question, please answer! CUDA Programming and Performance	4	3675	August 14, 2009
Mapping between CUDA cores and threads CUDA Programming and Performance	7	15419	December 2, 2011
What is the actual limit on simultaneously running threads? Asin, is it possible for more than one b CUDA Programming and Performance	20	2495	September 16, 2010
How many threads can reside in a CUDA core at the same time? CUDA Programming and Performance	2	881	January 18, 2019
Questions of Multiprocessor CUDA Programming and Performance	3	6156	April 15, 2008
concurrent kernels call on diffrent cpu threads CUDA Programming and Performance	4	3050	July 21, 2009
confused with some basic stuffs CUDA Programming and Performance	4	3269	August 29, 2007
A question the parallelization CUDA Programming and Performance	5	2694	July 29, 2008
a simple question about the resident blocks per multiprocessor CUDA Programming and Performance	6	3829	August 23, 2017

multiprocessor cores and SIMT

Related topics