The theoretical maximum number of blocks which can be executed concurrently by the GPU is SMs * max blocks per SM, correct.
The actual maximum can be lower. It depends on the kernel and its launch configuration.
You can use cudaOccupancyMaxActiveBlocksPerMultiprocessor to find the maximum number of blocks per SM for a specific kernel.
(Whatever the limits may be, launching a kernel with more blocks is perfectly valid.)
Whatever the limits may be, launching a kernel with more blocks is perfectly valid.
You mean:
Assuming I have 1096 tasks that can be parallelized and mapped to CUDA blocks, with each block requiring 32 threads, the maximum number of blocks that can run in parallel is calculated as
16 SMs × 2048/32 = 1024 blocks.
I should still create 1096 blocks, rather than splitting the workload into two kernels of 1024 and 65 blocks, or reusing threads from an already created thread configuration?
In this example, for the AGX Orin, there are 16 SMs, with each SM supporting up to 2048 threads.
Yes. Try both, if you like. But without any other information, splitting a single kernel of 1096 blocks into two kernels based on the idea of “concurrency” is not a recommended programming practice, in my view.
The GPU does not become less efficient when it has more than enough blocks.
If you are extremely concerned about concurrency, then the typical approach would be to use a grid-stride loop to cover the necessary data set size with the available thread complement, whatever you may decide it to be. But such refactorings rarely result in more than a few percent performance difference, in my experience. Many CUDA codes work just fine with little performance variation whether they are launched as one-element-per thread with a grid size equal to the data set size, or else in a grid-stride loop (lets say with the grid size matching the achievable occupancy or instantaneous thread carrying capacity – the idea in your use of the “concurrent”).
A bigger concern with your particular numbers could be the tail effect, but that is not easy to “rectify” in all cases. There is no magic wand for cases where your data set size is slightly larger than the thread carrying capacity or occupancy. A grid stride loop may help somewhat in the area of block starting overhead, but doesn’t address the work mismatch/wave effect.
Splitting into two kernels does not help with any of these concerns. And there are good reasons to think it may be worse. It is essentially enforcing no overlap of waves, unless you launch the two kernels into separate streams. Furthermore you incur some vestiges of launch overhead with each kernel launch. All other things being equal, this alone would be enough to discourage breaking one kernel into two.
This is truly an excellent and valuable insight. I understand that splitting into multiple kernels introduces additional overhead during launch. In fact, I also considered using multiple streams and CUDA graphs to hide this overhead, but from my experiments so far, the benefits of these changes are quite limited. They are far less effective than devising a new algorithm that increases computational intensity and better maps onto the grid.
However, I have not yet compared these approaches we mentioned for the same algorithm:
Directly generating enough blocks to cover all tasks
Splitting into multiple kernels and using CUDA streams and graphs to hide overhead
Reusing allocated threads with the grid-stride approach
Can I draw a preliminary conclusion that the first approach should be prioritized among these three methods?
I would probably go with a 1-3-2 priority, but personal preferences, intuition, etc. may be an equally good guide.
I frequently start with case 1 because it is often simplest to code, and acts (for me) as a baseline for future comparative experiments.
I might start with 3 in some case where I know that grid size is an important issue. For example if I am targeting a cooperative grid design, then starting with item 3 is the right approach (in my view). A basic transformation kernel is really not any more difficult to write using a grid-stride paradigm, so various CUDA coders may choose to use it “always” or as a matter of course. As a replacement for 1. Nothing wrong with that.
TBH, I don’t understand the need to look at item 2 without more information. I don’t generally follow that path for general investigation of transformation-style kernel design. But do as you wish of course. At some point or another in CUDA, I find that I learn by doing, not by asking questions. But feel free to ask questions. Everyone has a different shape.
As a maximum 1024 tasks can run in parallel, which is less than the 1096 you need.
So you would use 2 so-called waves.
To distribute them evenly:
2 waves * 16 SMs = 32 blocks
1096 / 2 = 548 tasks per wave.
548 / 16 = 34.25 tasks per SM per wave.
35 * 32 = 1120 threads per block. Each block doing 35 tasks.
(This number of 1120 tasks is the same as the 1120 threads per block just by accident. As 16 SMs * 2 waves is 32 and the number of threads for each task is also 32).
More waves
However, with 1120 threads per block, only one wave can be active at a time on a SM. That means as the wave progresses and nears its end, it gets more and more inefficient, until it is completely finished and the next wave can start. So you should choose sizes that at least two waves can overlap (maximum of 1024 threads per block).
[edit: The maximum number of threads is 1024 per block, so this configuration would not work anyway.]
Let’s do 3 waves.
1096 / 3 / 16 = 22.8 tasks per SM per wave
23 * 32 = 736 threads per block for 23 tasks
Now two consecutive waves can overlap.
Grid-stride loops
Compared to the grid-stride loop, we have the guarantee (by design) that each block does the same amount of work and not - as can happen with grid-stride loops
some SMs get stuck with double the amount of tasks than other SMs (if using blocks and threads) or
occupancy is low (few threads)
Normally grid-stride loops are good. But perhaps not so much in your case with a small amount of tasks.
32 threads per block is not a good idea. You generally cannot achieve full occupancy that way due to the blocks-per-sm limit.
If the processing time per task is roughly constant across tasks, I’m skeptical of any treatment that suggests more waves could be better. But I haven’t done a careful analysis. My read of the tail effect blog suggests trying to reduce the waves, not increase.
Correct, you would need a minimum of 96 threads per block in order to have a possibility to achieve full occupancy (without any consideration for the actual code itself.) So in that respect, 32 threads per block may not be a good choice.