Grid size limit of concurrent kernels

Hi,

I have a question regarding the launch configuration when running kernels concurrently (CUDA sample concurrentKernels:

When using a grid size of 1 (1x1x1) and a thread block size of 1 (1x1x1), all 8 kernels run concurrently in 8 streams.
However, when I increase the grid size, e.g. up to 100, the kernels do not run concurrently anymore. The duration of executing all 8 kernels now takes 405µs, while a grid size of 1 only needs 126µs.

My question is: What are the limits of grid size (and thread block size) at which kernels can still be executed concurrently?

Table 21 of the CUDA Programming Guide states a “maximum number of resident grids per device (Concurrent Kernel Execution)” of 128. Does this has something to do with my topic? If yes, how?

Thanks!

This is possibly a complicated question to answer. It is strongly related to the idea of occupancy, which is widely discussed on forums and even in CUDA documentation.

Briefly, occupancy considers how much of a GPUs resources you are using, and this is the key to answering your question. A GPU has various resource limits, such as the number of SMs which implies the maximum number of blocks that can be simultaneously resident or “concurrent”, also registers, shared memory, and other limits.

To determine occupancy statically or theoretically (as opposed to measuring it) you must assess all resources used by a kernel, and compare those to relevant limits of the GPU you are running on.

For a single kernel, we can do this using several methods including the previously linked discussion of the Occupancy Calculator API.

I don’t know of a tool to do it for multiple kernels, but to a first order approximation you would simply aggregate or sum up the requirements of the kernels, in each category, and again compare them to relevant limits. If you stay within all relevant limits, then theoretically that group of kernels could run concurrently.

AFAIK CUDA makes no guarantees that two kernels will run concurrently, even if they satisfy relevant limits and various requirements. Therefore, AFAiK, a code design that depends on or requires kernel concurrency for correct behavior is by definition broken. So that is something I would keep in mind before building a large use-case around kernel concurrency.

AFAIK, kernel concurrency has a primary objective to allow for, in some situations, increased utilization of a GPU (and therefore higher work efficiency) in the presence of limited work description (ie. kernels/work description that are individually “too small” to fill the GPU, or achieve the highest parallel efficiency.)

in your example, when you see you increased the grid to 100, I assume this means the number of blocks. All GPUs will have a limit to the number of blocks that can run concurrently, and there is a theoretical hardware limit as well as also a possibly lower limit due to kernel code design and the occupancy considerations given that code design. In any event, there will be a limit to the number of blocks that can be simultaneously resident on SMs or executing “concurrently”. I would presume that it is possible in your test, consider all the kernel launches in question in aggregate, you may have exceeded this limit (which is also a function of the GPU you are running on) and therefore may not witness full or any concurrency.

Yes, there is a hardware limit around the number of concurrent kernels (“grids”) that can run. But this number is I think 16 or higher, (may vary by GPU) and so is not likely to be an explanation for why you can run only 1 or 2 kernels concurrently.

Thanks for your reply!

Some additional info to make things more clear:
My usecase is processing frames of a video stream. But I don’t have just 1 video stream, instead I have 4 video streams at the same time. So my CUDA implementation has to process 4 frames “at the same time” or at least within a certain time constraint. So my design does not depend on running kernels concurrently, instead my idea was to increase the performance by running kernels concurrently.

While analyzing the performance of my kernels with NVIDIA Nsight Systems, I saw the value “Theoretical Occupancy” when hovering over a kernel. Can I view this value as the actual utilization of the kernel?

Do you also have other tips on how to optimize such usecases? The platform I am using is the Jetson AGX Orin Industrial (with CUDA 11.4, CUDA Capability Version 8.7).

Thanks!

This isn’t exactly the same as our previous discussion. A kernel can have a high occupancy and also be very large (e.g. lots of blocks in the grid), preventing concurrency with other kernels, for the reasons already discussed.

Theoretical occupancy is a measure of whether the kernel design could fill the GPU. To get a better handle on “utilization” for a particular kernel, I personally would switch to nsight compute from nsight systems.

I won’t be able to provide much advice with the information so far. It seems to me that a plausible design could be either one kernel per video stream/frame, or one per 4 frames. Without more info, I wouldn’t have much reason to choose one over the other. You might want to try both approaches.

I don’t think there is anything particular special about this video frame processing case, so typical analysis/optimization approaches are probably in order.

Hello.

Thank you for your detailed explanation thus far.

According to this table (Table 21 Technical Specifications per Compute Capability), the hardware limit of concurrent kernels for Compute Capability 8.0 is 128, correct? This implies that, if feasible, the GPU can support the concurrent execution of up to 128 kernels.

Is there a theoretical method, such as using the Occupancy Calculator or another tool, to determine the upper bound of concurrent kernels that can run on a GPU based on the grid size and thread block size of a kernel? What other parameters should be considered?

Thank you for your attention.

Yes,

The upper bound is 128 (for that GPU arch). We’ve already seen that if a kernel launch has enough blocks, it may prevent other kernels from running, but this is taking into account scheduler behavior. Two kernels are running concurrently if even one block of each kernel is running on the GPU. This thread is really asking about having the entire kernel (i.e. all its blocks) running concurrently.

If your question is about how many “small enough” kernels could/should run “completely” concurrently, then I’ve already described the methodology I would use:

If you’re unfamiliar with how to assess occupancy of a single kernel, then you should start there, in my opinion, before tackling this multi-kernel-concurrency case.

1 Like