My kernel need to load and process some data store in global memory. I try to load the data as early as possible and access them as late as possible in the kernel. However there is some latency to cover. As far as I understand I can add more work/threads that can be executed if an access to global memory data stalled a thread. Now I have several options to add more work. I can add more blocks, I can add more grids and I can increase the size of both, blocks and grids. Is there an advantage of one option over the other - or doesn’t it matter what I do to add more work?
If you have syntactically correct code that gives the correct answer, the first objective of any CUDA programmer is to expose enough parallelism. (The second objective is to use memory efficiently.)
Exposing enough parallelism roughly translates into having enough threads in your kernel launch (i.e. in the grid).
the block size should be chosen to flexibly have the capability to fill the GPU. Block size choices of 128, 256, and 512 generally work well for this. Very often there is little to differentiate performance wise between those 3 choices, the exact choice amongst those may be non-critical. Some designs that involve collective behavior will benefit from the larger size choices. In the ideal case, 1024 should be used sparingly - only if you know your code will run on a GPU that has either the 1024 or 2048 threads/SM hardware limit. cc8.6 Ampere GPUs, for example, have a threads per SM limit of 1536.
the grid should be sized to at least fill the GPU. The total number of threads that can be in-flight in a GPU (thread-carrying capacity) is upper-bounded by the maximum thread carrying capacity of the SM, times the number of SMs. If your grid size (number of blocks launched times threads per block) meets or exceeds this number, you have satisfied this goal.
If you have met those two goals, you have done a good job of giving the GPU the best chance it can to hide latency (considering grid sizing).
Additional tuning could include:
maximizing occupancy - by thread code (i.e. kernel) design, with attention to registers per thread used, shared memory used, and other limiters to occupancy.
constructing the grid to exactly match a multiple of the GPU thread carrying capacity (perhaps including occupancy considerations which may reduce it from the upper bound). For example, you might size your grid to exactly match this number, and use a grid-stride loop paradigm. Alternatively, you might allow for multiple waves but reduce the tail effect by having the grid size be an integer multiple (greater than 1) of the GPU thread carrying capacity.
the deviceQuery sample code will tell you various relevant data such as number of SMs and maximum threads per SM (as well as show how to retrieve this programmatically), or you can review various specifications/limits in the programming guide. CUDA also has an occupancy API to help with occupancy tuning at run time.