How do the thread blocks resides in the multiprocessors?

xinwu · April 13, 2012, 7:07pm

Hi, all!

I’m puzzled by the residence of the thread blocks on the multiprocessors.

Just suppose we have one Tesla M2090, which possesses 16 SMs, and we will launch a kernel with 16 thread blocks. There would be two different approaches:

a) SM : streaming multiprocessor TB : thread block
TB resides on SM evenly.
SM: 0 1 2 … 15
TB: 0 1 2 … 15

a) SM : streaming multiprocessor TB : thread block
The first 2 SMs hold 16 TBs, since the maximum for one SM is 8.
SM: 0 1 2 … 15
TB: 0 8
TB: 1 9
TB: 2 10
… … …
TB: 7 15

Could anyone figure out the correct scheme?

Thanks!

seibert · April 15, 2012, 12:32am

The developer has no control over how the scheduler distributes thread blocks to multiprocessors. It probably spreads the blocks over as many SMs as possible, but we have been given no guarantees from NVIDIA.

xinwu · April 16, 2012, 7:13am

Thank you!

xinwu · April 16, 2012, 7:34am

This question is somewhat related to my previous post (The Official NVIDIA Forums | NVIDIA).

I have a look at Steve Rennich’s webinar on CUDA C/C++ Streams and Concurrency. On Page 18, he mentioned “fill 1/2 of the SM resources”. I’m actually confused about a) is’t the programmers responsibly to control the kernel only fill 1/2 of the SM resources, or b) the execution configuration of the kernel is too small to run out of the SM resource, e.g. only 8 thread blocks (1024 threads in each) totally for a kernel, and there are 16 SMs available.

Now, it seems that b) should be the exact meaning of “fill 1/2 of the SM resources”.

Seibert, is my understanding correct?

seibert · April 16, 2012, 1:08pm

I’m not sure I understand the two choices, but let me try to answer a different way.

When you launch a kernel, you select the number of threads per block and the amount of shared memory to use per block. The number of threads per block determines the number of registers and the number of warps required to run the block. In order for the block to run at all, the number of warps, registers and amount of shared memory all have to be less than the multiprocessor limit for the CUDA architecture you are using. If you fail to meet that requirement, you will get an error code returned by the next CUDA call you make telling you that your launch configuration is invalid.

However, if the per-block resource usage in your kernel is low enough, the block scheduler can distribute multiple blocks to each multiprocessor for simultaneous execution. This helps the warp scheduler on the multiprocessor have more independent warps to work with, which helps hide instruction latency (for example, if some warps have to wait for many clock cycles on memory reads). This is why you often want to have many more blocks than multiprocessors, and you want to keep the resource usage of each block much lower than the multiprocessor limit.

Topic		Replies	Views
How blocks will be distributed among SPs ? CUDA Programming and Performance	4	1627	October 13, 2008
Scheduling of thread blocks on Stream Processors CUDA Programming and Performance	9	11182	June 7, 2010
Request clarification on CUDA runtime scheduling CUDA Programming and Performance	1	1793	September 5, 2008
understand the mapping of the block threads to SMs in GPU CUDA Programming and Performance	3	2832	August 2, 2018
Scheduling blocks to SMs at runtime CUDA Programming and Performance	7	2928	October 27, 2008
Newbie confusion: thread, block, multiprocessor and processor CUDA Programming and Performance	2	1396	April 13, 2011
Ensuring blocks per SM CUDA Programming and Performance	4	1169	February 20, 2012
Number of blocks parameter for kernel when GPU has just one SM CUDA Programming and Performance	3	581	August 4, 2017
Question about the number of SMs using in the program. CUDA Programming and Performance	3	872	April 9, 2018
how are blocks scheduled for execution? CUDA Programming and Performance	3	3559	December 9, 2016

How do the thread blocks resides in the multiprocessors?

Related topics