If a GPU kernel is composed of many blocks, each with many threads, then is there anything that can be said for sure about how those blocks will map onto the multiprocessors of a GPU? I ask because I’m working on code where each block needs some shared memory, and the shared memory on one multiprocessor is small enough that only a few blocks can fit at any one time. I know that multiple blocks can be active simultaneously on a single SM, and don’t want to overload shared memory.
If a single SM can fit 4 blocks worth of data in its shared memory, but each thread only needs a few registers, do I have to force the number of blocks to be <= the number of SMs, and give each block 4x as many threads in order to guarantee a given SM doesn’t need more than 4 blocks worth of shared memory active?
The driver and hardware take care of that themselves. A block will only run if there is sufficient free registers, free shared memory and the MP scheduler has enough free resources to manage the additional threads/warps. If your blocks are thread and register “light” but shared memory “heavy”, then the scheduler will limit the active blocks per MP on the basis of shared memory.
EDIT: there is a nice occupancy calculation spreadsheet in the toolkit that you can play with to get a feeling of how the various resource limits effect occupancy.
Now to make things more complicated: multiple blocks have a lot of data in common, and I’m parking that common data in shared memory too. If the common data takes up more than half of shared memory, am I correct in assuming that the scheduler will only see enough room for one block at a time on each SM?
(I can’t park the common data in constant memory, because different threads need different bits of it at the same time and the total working set size exceeds the size of the constant cache)
You can’t do that - shared memory is block scope and has the lifetime of a running block. It can only be shared amongst the threads within a running block, not between blocks and not between multiprocessors.
No, for the reasons outlined above. Shared memory is block scope, not independent of blocks.
Avid is correct, the block abstraction means you can’t reuse shared memory.
But I’ll elaborate on that:
Sometimes such init/shared memory commonality is so overlapping and so efficient that it’s worth the effort to manually schedule multiple work chunks within a block. This might be by making a loop INSIDE the block.
So your blockwise code would do Init(); then work(); but the loop method could be Init(); while(stuff_to_do) Work(loop++);
This amortizes that shared memory setup time.
Alternatively, you could even use global atomics to “fetch” new work inside the block. This works as well, but is getting more and more brittle since performance becomes sensitive to your exact problem. The extra complexity often can create inefficiencies (especially since you subvert SM scheduling, so you may have a single block delaying the completion of a whole kernel even when you have lots of idle SMs)
Another independent way to do intrablock scheduling, depending on your problem, is to do your init, then assign subproblems per WARP, if the warps don’t need to intercommunicate. This is usually not as efficient since you can’t use syncthreads() but it can still be an option especially if your computes are simple and uniform execution time.
Thanks to both of you; what I didn’t understand before was that shared data structures declared in .cu code get a per-block unique address assigned by the scheduler. I realize that one block cannot reuse shared memory initialized by another; it’s just that the nature of my computations is such that I can’t have more than a very few threads cooperating, and was looking at adding blocks to increase the parallelism.