Number of thread blocks and threads in those, difference for performance?

For fastest performance, is it typically better to have lots of threads within a thread block and fewer thread blocks, more thread blocks with fewer threads in each, or are those equal as long as registers, shared memory or other resources are not run out?

If those are not equal even when the registers etc. are not run out, what kind logic should one use to decide the optimal configuration, if the (number of thread blocks)*(threads in each) is constant?

GPUs being throughput machines, it is usually a good idea to load them up with work as much as possible. For this it is advantageous to use hardware resources with as fine a granularity as is feasible, with the goal of maximizing utilization. A reasonable rule of thumb is to start code design with a block size of between 128 and 256 threads that is also a multiple of 32, and adjust this up or down as the use case requires.

In modern HPC, FLOPS are often “too cheap to meter” while memory bandwidth is a frequent constraint on code performance. Block and grid organization can interact with per-thread data access patterns in ways that are non-obvious at times, and that can improve or diminish usable memory bandwidth. The CUDA profiler is an essential tool for monitoring these effects. I would suggest using it early and often.

1 Like