maximum threads per block not always used

mariapap95 · June 8, 2018, 12:52pm

Hello everyone,

As I saw, in current GPUs the maximum number of threads per block is 1024 but in most application that I have seen, they usually use either 256 or 128 threads.

What is the reason for not using all the available threads? Isn’t it like making the size of the block smaller while it could be wider?

Why wouldn’t be better to use all the available threads per block and reduce the blocks?

Robert_Crovella · June 8, 2018, 2:07pm

One possible reason is an attempt to maximize occupancy. Roughly defined, occupancy is the number of threads that are resident and executing on a SM. Occupancy may have a number of limiting factors, and the CUDA toolkit ships with an occupancy calculator spreadsheet to help you calculate possible occupancy for a particular code.

One limiter to occupancy can be register usage. Each GPU SM has a limited number of registers, in many cases 65536. Let’s also keep in mind that the maximum number of threads that a SM can sustain for execution is 2048. Achieving 2048 could be called 100% occupancy, and this can be considered an upper bound. Higher occupancy may lead to higher performance for some codes.

Suppose I have a code that uses 36 registers per thread. If I have a threadblock of 1024 threads, then 361024 registers are needed to support the execution of that threadblock. The SM has 65536 registers, so that works. But if I wanted to launch another threadblock on the same SM, I would need another 361024 registers. For that, I don’t have enough. So the maximum occupancy in this scenario would be 1024 threads, out of the maximum of 2048, or 50%.

Now suppose the theadblock size is 512, with no other changes. Each threadblock needs 512*36 registers. In this situation, I could support 3 threadblocks per SM, before running out of registers. This gives me an occupancy of 1536 threads, or 75%. It’s not guaranteed to be true, but in many cases this higher occupancy can lead to higher overall performance.

Shared memory usage (if any) by the kernel code, can be another limiting factor to occupancy, and in some cases it can have a similar effect as is described above for register usage.

Therefore, in some cases, smaller threadblock size can lead to higher overall performance.

mariapap95 · June 14, 2018, 10:02am

Thank you very much for your detailed explanation! It helped me a lot.

Topic		Replies	Views
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27699	February 15, 2010
Registers per thread limit and occupancy CUDA Programming and Performance	3	10050	March 30, 2007
Why is max threads per sm larger than max threads per block? CUDA Programming and Performance	3	1147	January 5, 2024
Maximum number of threads per block CUDA Programming and Performance	1	463	September 15, 2021
Why some algorithm uses small block? Obviously the larger the better!? CUDA Programming and Performance	10	1007	September 28, 2023
How to use "block" and "thread" CUDA Programming and Performance	5	1255	October 16, 2013
maximum number of blocks CUDA Programming and Performance	3	2381	April 10, 2008
registers occupancy and # of threads CUDA Programming and Performance	3	3360	July 26, 2008
Lots of blocks, shared memory question CUDA Programming and Performance	2	726	April 20, 2015
Number of blocks parameter for kernel when GPU has just one SM CUDA Programming and Performance	3	511	August 4, 2017

maximum threads per block not always used

Related topics