I’m trying to understand how I, the CUDA programmer, should try to think of and understand Shared Memory. I am running the Tesla C1060. So, when I run the SDK example Device Query, it says that there is 16,348 bytes of shared memory available per block. This, of course, isn’t referring to the number of thread blocks. This is referring to the amount of shared memory available to each of the 30 streaming multiprocessors (SMs), correct? Each SM can actively schedule at most 8 blocks.
So, when I am programming, am I supposed to program under the assumption that I will make my code such that all 8 cores are occupied across all 30 SMs? Or should I plan for half that occupancy? What if the data were to cause variance of the occupancy?
With 4 thread blocks scheduled at a time on a SM, then that means each thread block can use, when distributed evenly, 16KB/4 = 4KB of shared memory. For full occupancy, each block can use 2KB of shared memory.
Mainly, I’m wondering if this is the correct understanding on how to divvy up shared memory across thread blocks? Is there something else I should be considering?
You are correct that the listed shared memory size is for the SM, which can be split between 1 thru 8 blocks.
Note that “occupancy” is a measure of how many warps are active on the SM relative to the maximum. This does not directly have to do with the number of blocks. On your C1060, the maximum number of active warps per SM is 24 (768 threads), so you can reach 100% occupancy with 3 blocks and 256 threads per block, 8 blocks with 96 threads, and anything in between. You can’t achieve 100% occupancy with 1 block, because the maximum # of threads per block is only 512. Note that 100% occupancy is usually not significantly better than 50% occupancy in terms of performance. It is not a measure of utilization of the CUDA cores, just the number of active warps.
Another thing is that there is no connection between blocks and CUDA cores. Blocks are mapped to SMs, and warps of 32 threads at a time are executed in parallel by the 8 cores (really just fancy arithmetic/logic/floating point units rather than anything as general as a CPU core). So as long as the number of threads in your block is a multiple of 32, you’ll keep all the cores busy, with the exception of stalls while waiting for global memory reads to complete.
Thank you for the detailed reply. Now, let me see that I fully understand you.
So, I am to understand that one block gets scheduled to a SM, and the 8 cores of that SM work to execute in parallel each warp inside the one block? If that were so, then I wouldn’t have to worry about splitting shared memory among 2-8 blocks because there’s just one being processed by the SM at a given moment.
I think I need to go back to the programming guide or some other material on this. Because this begs other questions, and it doesn’t seem to make sense right now.
Another thing to keep in mind is that the hardware is designed to make switching between warps have essentially zero overhead. (That’s why high occupancy helps. It allows the CUDA cores to keep busy on other warps while some are stalled waiting for global memory reads to finish.) In order to do this, unlike multiprocessing on a CPU, every active block on a SM has all of its resource allocated at the start of block execution. There is no “context switch” like with normal CPU threading. Instead the scheduler on the SM has a list of up to 24 warps, and it fires off the next instruction into the pipelines of the 8 CUDA cores for whichever warps aren’t stalled. A block, once started, never releases its resources on the SM until it is finished.
The preallocation of resources for any active (where “active” means that warps from that block are available to the SM scheduler for execution) block is why blocks have to divide the shared memory resources. If a block uses 6 kB of shared memory, that immediately limits the number of active blocks per SM to 2 or less.
Thanks again for the information, seibert. I’m working on processing it all.
I am curious about something. I read that for pre-Fermi cards, a given SM will wait until all blocks scheduled on it finish before scheduling more blocks. First of all, is this true? If so, then what happens if all blocks on the SM are waiting on some kind of data that never arrives (poorly written code, I know)? Does the GPU ever let go of these blocks until the data “might be” provided later, or does it just hang?
I haven’t kept track of experiments to deduce the exact block scheduling behavior, but NVIDIA employees have said that things are most efficient if blocks do similarly sized workloads.
As to the second question: How can a block wait for data that never arrives? That can’t happen. A global memory read will finish in a fixed amount of time, or abort the entire kernel if it reads an illegal address. A block can go into an infinite loop, for example if one or more threads are waiting on some kind of flag in memory to be set before continuing. However, those kind of global barrier constructs are discouraged in CUDA for the reason you mention. A block, once started, cannot be swapped off of the SM until it completes.
This is not as scary as it sounds if your programs follow the generally encouraged CUDA style of “every thread is doing approximately the same operations to different data elements”.