How to know the maximum blocks I can launch

tnwilly · November 8, 2024, 9:25am

Hi,

I’m using AGX Orin and RTX4060, please tell me how to know the maximum blocks that I can launch concurrently.

As I know, AGX Orin has 16 SMs, and the maximum number of thread blocks per SM is 32 for devices of compute capability 8.0.

So does it mean the maximum blocks I can launch concurrently is 16*32 = 512?

striker159 · November 8, 2024, 11:28am

The theoretical maximum number of blocks which can be executed concurrently by the GPU is SMs * max blocks per SM, correct.

The actual maximum can be lower. It depends on the kernel and its launch configuration.
You can use cudaOccupancyMaxActiveBlocksPerMultiprocessor to find the maximum number of blocks per SM for a specific kernel.

(Whatever the limits may be, launching a kernel with more blocks is perfectly valid.)

Curefab · November 8, 2024, 3:30pm

With the (seldomly used) exception of launching a cooperative kernel

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g06d753134145c4584c0c62525c1894cb

tnwilly · November 8, 2024, 3:50pm

Whatever the limits may be, launching a kernel with more blocks is perfectly valid.

You mean:

Assuming I have 1096 tasks that can be parallelized and mapped to CUDA blocks, with each block requiring 32 threads, the maximum number of blocks that can run in parallel is calculated as

16 SMs × 2048/32 = 1024 blocks.

I should still create 1096 blocks, rather than splitting the workload into two kernels of 1024 and 65 blocks, or reusing threads from an already created thread configuration?

In this example, for the AGX Orin, there are 16 SMs, with each SM supporting up to 2048 threads.

Robert_Crovella · November 8, 2024, 4:07pm

Yes. Try both, if you like. But without any other information, splitting a single kernel of 1096 blocks into two kernels based on the idea of “concurrency” is not a recommended programming practice, in my view.

The GPU does not become less efficient when it has more than enough blocks.

If you are extremely concerned about concurrency, then the typical approach would be to use a grid-stride loop to cover the necessary data set size with the available thread complement, whatever you may decide it to be. But such refactorings rarely result in more than a few percent performance difference, in my experience. Many CUDA codes work just fine with little performance variation whether they are launched as one-element-per thread with a grid size equal to the data set size, or else in a grid-stride loop (lets say with the grid size matching the achievable occupancy or instantaneous thread carrying capacity – the idea in your use of the “concurrent”).

A bigger concern with your particular numbers could be the tail effect, but that is not easy to “rectify” in all cases. There is no magic wand for cases where your data set size is slightly larger than the thread carrying capacity or occupancy. A grid stride loop may help somewhat in the area of block starting overhead, but doesn’t address the work mismatch/wave effect.

Splitting into two kernels does not help with any of these concerns. And there are good reasons to think it may be worse. It is essentially enforcing no overlap of waves, unless you launch the two kernels into separate streams. Furthermore you incur some vestiges of launch overhead with each kernel launch. All other things being equal, this alone would be enough to discourage breaking one kernel into two.

tnwilly · November 8, 2024, 4:31pm

Hi Robert,

This is truly an excellent and valuable insight. I understand that splitting into multiple kernels introduces additional overhead during launch. In fact, I also considered using multiple streams and CUDA graphs to hide this overhead, but from my experiments so far, the benefits of these changes are quite limited. They are far less effective than devising a new algorithm that increases computational intensity and better maps onto the grid.

However, I have not yet compared these approaches we mentioned for the same algorithm:

Directly generating enough blocks to cover all tasks
Splitting into multiple kernels and using CUDA streams and graphs to hide overhead
Reusing allocated threads with the grid-stride approach

Can I draw a preliminary conclusion that the first approach should be prioritized among these three methods?

Thanks!

Robert_Crovella · November 8, 2024, 4:37pm

I would probably go with a 1-3-2 priority, but personal preferences, intuition, etc. may be an equally good guide.

I frequently start with case 1 because it is often simplest to code, and acts (for me) as a baseline for future comparative experiments.

I might start with 3 in some case where I know that grid size is an important issue. For example if I am targeting a cooperative grid design, then starting with item 3 is the right approach (in my view). A basic transformation kernel is really not any more difficult to write using a grid-stride paradigm, so various CUDA coders may choose to use it “always” or as a matter of course. As a replacement for 1. Nothing wrong with that.

TBH, I don’t understand the need to look at item 2 without more information. I don’t generally follow that path for general investigation of transformation-style kernel design. But do as you wish of course. At some point or another in CUDA, I find that I learn by doing, not by asking questions. But feel free to ask questions. Everyone has a different shape.

Curefab · November 8, 2024, 4:52pm

Each warp processes one task

2 Waves

As a maximum 1024 tasks can run in parallel, which is less than the 1096 you need.
So you would use 2 so-called waves.

To distribute them evenly:

2 waves * 16 SMs = 32 blocks

1096 / 2 = 548 tasks per wave.

548 / 16 = 34.25 tasks per SM per wave.

35 * 32 = 1120 threads per block. Each block doing 35 tasks.

(This number of 1120 tasks is the same as the 1120 threads per block just by accident. As 16 SMs * 2 waves is 32 and the number of threads for each task is also 32).

More waves

However, with 1120 threads per block, only one wave can be active at a time on a SM. That means as the wave progresses and nears its end, it gets more and more inefficient, until it is completely finished and the next wave can start. So you should choose sizes that at least two waves can overlap (maximum of 1024 threads per block).

[edit: The maximum number of threads is 1024 per block, so this configuration would not work anyway.]

Let’s do 3 waves.

1096 / 3 / 16 = 22.8 tasks per SM per wave

23 * 32 = 736 threads per block for 23 tasks

Now two consecutive waves can overlap.

Grid-stride loops

Compared to the grid-stride loop, we have the guarantee (by design) that each block does the same amount of work and not - as can happen with grid-stride loops

some SMs get stuck with double the amount of tasks than other SMs (if using blocks and threads) or
occupancy is low (few threads)

Normally grid-stride loops are good. But perhaps not so much in your case with a small amount of tasks.

Robert_Crovella · November 8, 2024, 7:05pm

32 threads per block is not a good idea. You generally cannot achieve full occupancy that way due to the blocks-per-sm limit.

If the processing time per task is roughly constant across tasks, I’m skeptical of any treatment that suggests more waves could be better. But I haven’t done a careful analysis. My read of the tail effect blog suggests trying to reduce the waves, not increase.

There is no such animal in CUDA.

tnwilly · November 9, 2024, 4:45am

Hi Robert,

For AGX Orin, CC = 8.7
SM number = 16
Maximum number of resident blocks per SM = 16
Maximum number of resident threads per SM = 1536

How can I determine the theoretical maximum number of threads to achieve full GPU occupancy?

Maximum threads per SM (1536) / Maximum resident blocks per SM (16) = 96 threads per block?

Robert_Crovella · November 9, 2024, 3:01pm

Correct, you would need a minimum of 96 threads per block in order to have a possibility to achieve full occupancy (without any consideration for the actual code itself.) So in that respect, 32 threads per block may not be a good choice.

Topic		Replies	Views
More blocks than SMs may not make sense CUDA Programming and Performance	13	2692	November 11, 2010
Confusion about setting kernel block and grid size for maximum occupancy CUDA Programming and Performance cuda	11	857	March 30, 2024
Max threads/blocks CUDA Programming and Performance	10	92	September 6, 2024
Maximizing the number of threads per block leads to longer kernel execution times CUDA Programming and Performance cuda , kernel	12	1948	December 19, 2023
Grid size limit of concurrent kernels CUDA Programming and Performance	6	797	April 5, 2024
performance cost of too many blocks? CUDA Programming and Performance	12	2824	December 4, 2018
max number of block CUDA Programming and Performance	21	17720	April 20, 2010
Max gridDim.x ? CUDA Programming and Performance	7	4497	March 11, 2010
Is there a limit for the maximum thread blocks? CUDA Programming and Performance	14	188	August 19, 2024
Maximum number of threads How to find maximum number of threads your Card can support CUDA Programming and Performance	16	10279	July 7, 2009

How to know the maximum blocks I can launch

Related topics