How do the thread blocks resides in the multiprocessors?

Hi, all!

I’m puzzled by the residence of the thread blocks on the multiprocessors.

Just suppose we have one Tesla M2090, which possesses 16 SMs, and we will launch a kernel with 16 thread blocks. There would be two different approaches:

a) SM : streaming multiprocessor TB : thread block
TB resides on SM evenly.
SM: 0 1 2 … 15
TB: 0 1 2 … 15

a) SM : streaming multiprocessor TB : thread block
The first 2 SMs hold 16 TBs, since the maximum for one SM is 8.
SM: 0 1 2 … 15
TB: 0 8
TB: 1 9
TB: 2 10
… … …
TB: 7 15

Could anyone figure out the correct scheme?

Thanks!

The developer has no control over how the scheduler distributes thread blocks to multiprocessors. It probably spreads the blocks over as many SMs as possible, but we have been given no guarantees from NVIDIA.

Thank you!

This question is somewhat related to my previous post (The Official NVIDIA Forums | NVIDIA).

I have a look at Steve Rennich’s webinar on CUDA C/C++ Streams and Concurrency. On Page 18, he mentioned “fill 1/2 of the SM resources”. I’m actually confused about a) is’t the programmers responsibly to control the kernel only fill 1/2 of the SM resources, or b) the execution configuration of the kernel is too small to run out of the SM resource, e.g. only 8 thread blocks (1024 threads in each) totally for a kernel, and there are 16 SMs available.

Now, it seems that b) should be the exact meaning of “fill 1/2 of the SM resources”.

Seibert, is my understanding correct?

I’m not sure I understand the two choices, but let me try to answer a different way.

When you launch a kernel, you select the number of threads per block and the amount of shared memory to use per block. The number of threads per block determines the number of registers and the number of warps required to run the block. In order for the block to run at all, the number of warps, registers and amount of shared memory all have to be less than the multiprocessor limit for the CUDA architecture you are using. If you fail to meet that requirement, you will get an error code returned by the next CUDA call you make telling you that your launch configuration is invalid.

However, if the per-block resource usage in your kernel is low enough, the block scheduler can distribute multiple blocks to each multiprocessor for simultaneous execution. This helps the warp scheduler on the multiprocessor have more independent warps to work with, which helps hide instruction latency (for example, if some warps have to wait for many clock cycles on memory reads). This is why you often want to have many more blocks than multiprocessors, and you want to keep the resource usage of each block much lower than the multiprocessor limit.