mapping blocks to GPU SM's?

jasonp · April 27, 2010, 2:23pm

If a GPU kernel is composed of many blocks, each with many threads, then is there anything that can be said for sure about how those blocks will map onto the multiprocessors of a GPU? I ask because I’m working on code where each block needs some shared memory, and the shared memory on one multiprocessor is small enough that only a few blocks can fit at any one time. I know that multiple blocks can be active simultaneously on a single SM, and don’t want to overload shared memory.

If a single SM can fit 4 blocks worth of data in its shared memory, but each thread only needs a few registers, do I have to force the number of blocks to be <= the number of SMs, and give each block 4x as many threads in order to guarantee a given SM doesn’t need more than 4 blocks worth of shared memory active?

avidday · April 27, 2010, 2:33pm

If a GPU kernel is composed of many blocks, each with many threads, then is there anything that can be said for sure about how those blocks will map onto the multiprocessors of a GPU? I ask because I’m working on code where each block needs some shared memory, and the shared memory on one multiprocessor is small enough that only a few blocks can fit at any one time. I know that multiple blocks can be active simultaneously on a single SM, and don’t want to overload shared memory.

If a single SM can fit 4 blocks worth of data in its shared memory, but each thread only needs a few registers, do I have to force the number of blocks to be <= the number of SMs, and give each block 4x as many threads in order to guarantee a given SM doesn’t need more than 4 blocks worth of shared memory active?

The driver and hardware take care of that themselves. A block will only run if there is sufficient free registers, free shared memory and the MP scheduler has enough free resources to manage the additional threads/warps. If your blocks are thread and register “light” but shared memory “heavy”, then the scheduler will limit the active blocks per MP on the basis of shared memory.

EDIT: there is a nice occupancy calculation spreadsheet in the toolkit that you can play with to get a feeling of how the various resource limits effect occupancy.

jasonp · April 27, 2010, 4:34pm

Okay, that makes sense.

Now to make things more complicated: multiple blocks have a lot of data in common, and I’m parking that common data in shared memory too. If the common data takes up more than half of shared memory, am I correct in assuming that the scheduler will only see enough room for one block at a time on each SM?

(I can’t park the common data in constant memory, because different threads need different bits of it at the same time and the total working set size exceeds the size of the constant cache)

avidday · April 27, 2010, 5:01pm

You can’t do that - shared memory is block scope and has the lifetime of a running block. It can only be shared amongst the threads within a running block, not between blocks and not between multiprocessors.

No, for the reasons outlined above. Shared memory is block scope, not independent of blocks.

SPWorley · April 27, 2010, 5:17pm

Avid is correct, the block abstraction means you can’t reuse shared memory.

But I’ll elaborate on that:
Sometimes such init/shared memory commonality is so overlapping and so efficient that it’s worth the effort to manually schedule multiple work chunks within a block. This might be by making a loop INSIDE the block.

So your blockwise code would do Init(); then work(); but the loop method could be Init(); while(stuff_to_do) Work(loop++);
This amortizes that shared memory setup time.

Alternatively, you could even use global atomics to “fetch” new work inside the block. This works as well, but is getting more and more brittle since performance becomes sensitive to your exact problem. The extra complexity often can create inefficiencies (especially since you subvert SM scheduling, so you may have a single block delaying the completion of a whole kernel even when you have lots of idle SMs)

Another independent way to do intrablock scheduling, depending on your problem, is to do your init, then assign subproblems per WARP, if the warps don’t need to intercommunicate. This is usually not as efficient since you can’t use syncthreads() but it can still be an option especially if your computes are simple and uniform execution time.

jasonp · April 28, 2010, 2:06am

Thanks to both of you; what I didn’t understand before was that shared data structures declared in .cu code get a per-block unique address assigned by the scheduler. I realize that one block cannot reuse shared memory initialized by another; it’s just that the nature of my computations is such that I can’t have more than a very few threads cooperating, and was looking at adding blocks to increase the parallelism.

Topic		Replies	Views
Shared memory per block Related to shared memory of an MCPU CUDA Programming and Performance	3	3996	August 14, 2007
Not enough shared mem CUDA Programming and Performance	5	5785	November 3, 2009
Question about max shared memory in block and multiprocessor CUDA Programming and Performance	2	1447	February 20, 2024
SM has to finish one block before executing another? CUDA Programming and Performance	9	5283	October 27, 2010
Shared Memory Buffer CUDA Programming and Performance	1	2690	May 13, 2011
What will be happen in the situation CUDA Programming and Performance	9	6252	December 23, 2008
Scope of shared memory in CUDA CUDA Programming and Performance	12	3899	November 27, 2015
Assign blocks to SMs CUDA Programming and Performance	5	1597	February 4, 2019
Usage of shared memory CUDA Programming and Performance	12	119	February 15, 2025
Occupancy is not like I expected CUDA Programming and Performance	4	532	June 29, 2020

mapping blocks to GPU SM's?

Related topics