Given we fully use the threads, should we have as many blocks as possible or should we let each thread to do as much work as possible?

For a well crafted sum reduction, a smaller number of threads, while still achieving full occupancy, will generally perform better. The rationale is given here. We want to do as much work as possible in the serial per-thread running sum phase, before we switch to the parallel sweep. This is maximized by reducing the total threads in the grid as I suggest.

(You won’t be able to sum a 2^100 vector in a single pass on a GPU, probably, but the point is valid for smaller vector sizes.)

You generally don’t want to drop far below full occupancy, because this will eventually have a negative effect on achievable bandwidth, and the algorithm as a whole is memory bandwidth bound. So if you’re interested in absolute peak performance, it might be necessary to benchmark several cases. But the desirable grid size range will be at or close to maximum occupancy, and no more, for this kind of canonical parallel reduction.

1 Like