I call this a grid-striding loop.
It allows a given grid size (i.e. the total number of threads in a grid) to process a larger data-set size.
So I can launch, for example, 100,000 threads, but fully process a data set consisting of 1,000,000 elements (or larger). The grid of threads “strides” across the data set.
It has some advantages:
- The grid size (and grid size calculations) can be decoupled from the data set size.
- The total number of threads can be optimized for the machine architecture.
This second point is a “small” optimization. New CUDA programmers should not pay undue attention to it or draw extensive conclusions from it. However, the basic idea is that once enough parallelism (i.e. threads) has been exposed to fully utilize the machine, exposing additional parallelism (creating more threads) generally does not improve performance, and may actually decrease performance slightly (a few percent?).
Therefore, we launch fewer threads, while still maximizing or optimizing for the machine capacity, and give each thread more work to do. For full optimization, you would probably want to tune the grid size for the machine archtitecture rather than the problem size as might be typical. So on a Kepler, you might want to launch more threads in your grid than on a Fermi, for example. More specifically, you might want to read some of the device properties at runtime, and make a decision about grid size based on those properties (number of SMs, max threads per SM, max blocks per SM, etc.)
Note that at an introductory level, we do not make grid size decisions like this. In particular, we do not set the number of threads equal to the total number of cores, or any such heuristic. This type of optimization can only be fully understood after a solid understanding of the nature of the latency-hiding process on GPUs is acquired.