Cuda kernel blocking launch

bloch.aurelien · March 15, 2023, 11:32pm

I was having weird behavior in my code and realized I was witnessing synchronous kernel launches.
I am launching 30 kernels but after a couple of launches, the launch becomes blocking until one of the kernels finishes. If I understand correctly I am probably hitting some resource limitation but I’m having a hard time figuring out which one. I tried to reduce the number of register using maxrregcount but not luck. I have also tried to use nsight compute for information but I don’t know where to look.

for (unsigned int i(0); i < cuda_size; i++) {
    kernels[i]<<<1, 1, 0, compute_streams[i]>>>(cudaActors[i], devStatus, i, cuda_size);
    printf("Done\n");
}

I was wondering what was the list of possible resources that I could be missing.
shared memory ? Heap ? source code?
How could I know which one?
Thanks

Robert_Crovella · March 15, 2023, 11:36pm

There is a launch queue for kernels (really, for all asynchronous work issued to the GPU from host code) and it behaves the way you describe when it gets full, however I’ve never seen it be “full” after just 30 kernel launches and nothing else.

It’s impossible to be definitive without a complete example to inspect.

If you hitting the queue depth limit, there isn’t much you can do about that. It’s not adjustable, and there are no specifics published about it. With a bit of searching you can find questions on various forums with examples of code hitting the queue depth limit.

bloch.aurelien · March 15, 2023, 11:41pm

I have seen these posts indeed but people seem to have this issue with 1000s of kernels also if my kernels are simple it works. So I was thinking that it was another kind of resource but I couldn’t read anything that should result in a synchronous behavior. Are they other resources besides this queue that might make the call blocking ?
Thanks

njuffa · March 16, 2023, 12:00am

The last time we talked about kernel launch queue length it was determined to be on the order of 1K launches. However, I suspect the depth of the queue is specified in bytes and more complicate launches may involve more information being passed per launched kernel so number of kernels to queue-able may differ. The observed number of 30 kernels doesn’t lime up with anything in my mind.

What platform is being used here? Windows 10/11 with the default WDDM driver? That can give rise to all kind of kernel launch artifacts, but not of the kind described here (in my experience).

Is there any chance that some of these kernels run for a very long time, possibly inadvertently? If the queue is a true FIFO, new queue space might not become available until the oldest kernel in progress completes. That still wouldn’t explain the number 30, though.

Without some self-contained repro code, I don’t think we’ll gain any more insight here.

Robert_Crovella · March 16, 2023, 3:53am

Yes, there is variability in the queue capacity according to some parameters that I don’t know. Here is a demonstration that the queue capacity can vary greatly (in terms o number of “items”) depending on the mix of “items” that you put into it, demonstrating a number as low as ~100. Maybe there are other ways to witness 30. Maybe as njuffa suggestions, it can be related to bytes, but I don’t know to convert a kernel launch into bytes, precisely, although maybe a first order approximation is the aggregate size of the parameters/arguments. (and it seems reasonable to me to assume that those need to be captured at the launch point and stored somewhere, so if the queue is sized based on that, then “bytes” may be very indicative.)

I’m not aware of any, and have never witnessed any. I don’t at all understand how a detail like local memory usage (for example) could cause a need to switch from async to sync or affect queue depth, but I certainly don’t know everything.

Topic		Replies	Views
CUDA stream are blocked when luanch lots of kernels (>1000) CUDA Programming and Performance	3	582	December 30, 2018
Kernel launches blocking when 1024 kernels in a queue CUDA Programming and Performance	3	2654	July 8, 2014
cost for launching (a lot of) CUDA kernels CUDA Programming and Performance	5	9714	April 15, 2010
Maximum number of operations in a stream CUDA Programming and Performance	4	1048	June 2, 2023
"too many resources requested for launch." - on second launch of a kernel CUDA Programming and Performance	5	2185	March 8, 2018
kernel FIFO queue length CUDA Programming and Performance	3	1425	December 26, 2014
how long is the kernel launch queue these days? CUDA Programming and Performance	4	1829	December 17, 2010
Some kernel launch is taking much longer (100x) than others in the same Cuda Stream CUDA Programming and Performance	7	432	February 10, 2024
Launching Kernel Fail CUDA Programming and Performance	15	3404	May 28, 2014
Why does my kernel launch? CUDA Programming and Performance	5	5986	February 13, 2009

Cuda kernel blocking launch

Related topics