Given we fully use the threads, should we have as many blocks as possible or should we let each thread to do as much work as possible?

Hi everyone, I am new to CUDA programming and my question is given by the title.
For example, if we need to sum a 2^100 vector, should we let each thread to sum 2 numbers and have many blocks, or should we have just enough blocks to fill the GPU and let each thread to sum many numbers? Which one is faster?

For a well crafted sum reduction, a smaller number of threads, while still achieving full occupancy, will generally perform better. The rationale is given here. We want to do as much work as possible in the serial per-thread running sum phase, before we switch to the parallel sweep. This is maximized by reducing the total threads in the grid as I suggest.

(You won’t be able to sum a 2^100 vector in a single pass on a GPU, probably, but the point is valid for smaller vector sizes.)

You generally don’t want to drop far below full occupancy, because this will eventually have a negative effect on achievable bandwidth, and the algorithm as a whole is memory bandwidth bound. So if you’re interested in absolute peak performance, it might be necessary to benchmark several cases. But the desirable grid size range will be at or close to maximum occupancy, and no more, for this kind of canonical parallel reduction.

1 Like

Thank you! As a follow up question, would you say the two stage reduction explained in https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf is the standard approach for reduction? Are there more involved tricks(multi stages for example)?

Yes, I would still say that is the “standard approach” but the shared memory sweep reduction may be replaced with a warp-shuffle reduction for better performance in some cases. That is also covered in the link I gave. There are other optimizations that can be made that may or may not give a noticeable performance improvement, such as eliminating the second stage via atomics or a threadblock-draining strategy. The first “optimization” (atomics at the end of the parallel sweep) is covered here and the second one has a CUDA sample code (threadfence reduction) to provide an example.

Reductions can be used for a variety of purposes. One design pattern does not fit all use cases.

1 Like