maximum threads per block not always used

Hello everyone,

As I saw, in current GPUs the maximum number of threads per block is 1024 but in most application that I have seen, they usually use either 256 or 128 threads.

What is the reason for not using all the available threads? Isn’t it like making the size of the block smaller while it could be wider?

Why wouldn’t be better to use all the available threads per block and reduce the blocks?

One possible reason is an attempt to maximize occupancy. Roughly defined, occupancy is the number of threads that are resident and executing on a SM. Occupancy may have a number of limiting factors, and the CUDA toolkit ships with an occupancy calculator spreadsheet to help you calculate possible occupancy for a particular code.

One limiter to occupancy can be register usage. Each GPU SM has a limited number of registers, in many cases 65536. Let’s also keep in mind that the maximum number of threads that a SM can sustain for execution is 2048. Achieving 2048 could be called 100% occupancy, and this can be considered an upper bound. Higher occupancy may lead to higher performance for some codes.

Suppose I have a code that uses 36 registers per thread. If I have a threadblock of 1024 threads, then 361024 registers are needed to support the execution of that threadblock. The SM has 65536 registers, so that works. But if I wanted to launch another threadblock on the same SM, I would need another 361024 registers. For that, I don’t have enough. So the maximum occupancy in this scenario would be 1024 threads, out of the maximum of 2048, or 50%.

Now suppose the theadblock size is 512, with no other changes. Each threadblock needs 512*36 registers. In this situation, I could support 3 threadblocks per SM, before running out of registers. This gives me an occupancy of 1536 threads, or 75%. It’s not guaranteed to be true, but in many cases this higher occupancy can lead to higher overall performance.

Shared memory usage (if any) by the kernel code, can be another limiting factor to occupancy, and in some cases it can have a similar effect as is described above for register usage.

Therefore, in some cases, smaller threadblock size can lead to higher overall performance.

Thank you very much for your detailed explanation! It helped me a lot.