Cuda kernel blocking launch

I was having weird behavior in my code and realized I was witnessing synchronous kernel launches.
I am launching 30 kernels but after a couple of launches, the launch becomes blocking until one of the kernels finishes. If I understand correctly I am probably hitting some resource limitation but I’m having a hard time figuring out which one. I tried to reduce the number of register using maxrregcount but not luck. I have also tried to use nsight compute for information but I don’t know where to look.

for (unsigned int i(0); i < cuda_size; i++) {
    kernels[i]<<<1, 1, 0, compute_streams[i]>>>(cudaActors[i], devStatus, i, cuda_size);

I was wondering what was the list of possible resources that I could be missing.
shared memory ? Heap ? source code?
How could I know which one?

There is a launch queue for kernels (really, for all asynchronous work issued to the GPU from host code) and it behaves the way you describe when it gets full, however I’ve never seen it be “full” after just 30 kernel launches and nothing else.

It’s impossible to be definitive without a complete example to inspect.

If you hitting the queue depth limit, there isn’t much you can do about that. It’s not adjustable, and there are no specifics published about it. With a bit of searching you can find questions on various forums with examples of code hitting the queue depth limit.

I have seen these posts indeed but people seem to have this issue with 1000s of kernels also if my kernels are simple it works. So I was thinking that it was another kind of resource but I couldn’t read anything that should result in a synchronous behavior. Are they other resources besides this queue that might make the call blocking ?

The last time we talked about kernel launch queue length it was determined to be on the order of 1K launches. However, I suspect the depth of the queue is specified in bytes and more complicate launches may involve more information being passed per launched kernel so number of kernels to queue-able may differ. The observed number of 30 kernels doesn’t lime up with anything in my mind.

What platform is being used here? Windows 10/11 with the default WDDM driver? That can give rise to all kind of kernel launch artifacts, but not of the kind described here (in my experience).

Is there any chance that some of these kernels run for a very long time, possibly inadvertently? If the queue is a true FIFO, new queue space might not become available until the oldest kernel in progress completes. That still wouldn’t explain the number 30, though.

Without some self-contained repro code, I don’t think we’ll gain any more insight here.

Yes, there is variability in the queue capacity according to some parameters that I don’t know. Here is a demonstration that the queue capacity can vary greatly (in terms o number of “items”) depending on the mix of “items” that you put into it, demonstrating a number as low as ~100. Maybe there are other ways to witness 30. Maybe as njuffa suggestions, it can be related to bytes, but I don’t know to convert a kernel launch into bytes, precisely, although maybe a first order approximation is the aggregate size of the parameters/arguments. (and it seems reasonable to me to assume that those need to be captured at the launch point and stored somewhere, so if the queue is sized based on that, then “bytes” may be very indicative.)

I’m not aware of any, and have never witnessed any. I don’t at all understand how a detail like local memory usage (for example) could cause a need to switch from async to sync or affect queue depth, but I certainly don’t know everything.