Maximum number of operations in a stream

Hello,

I am currently programming an HPC application where I submit many operations (my own kernels, cublas/cusparse/cusolver routines) into a single stream. I ran into a problem that the submission starts being synchronous after a while. The kernel launch should be asynchronous, but I saw that it took “a lot of time” to launch the kernel.

I boiled the problem down to a simple program (see below), where I found out that after I submit about 1022 kernels into a stream (no difference between a null and a non-null stream), the kernel launches then start taking a long time. My guess is that there is that there is a limit to how many operations can be in a stream, and if the limit is exceeded, it has to wait for the first operation to finish, to be able to enqueue a new one.

I have not been able to find anything describing such or similar behavior in the documentation nor anywhere on the internet (just an unanswered question here). Table 15 in the programming guide did not help. The closest is “Maximum number of resident grids per device (Concurrent Kernel Execution)”, but it is something different, and the numbers don’t match up anyway.

Could anyone please explain/confirm this behavior, or direct me to where I could find some info about this?

Thanks,

Jakub


The program I am using: source.cu (1.7 KB)

Compiling it with nvcc -g -O2 -Xcompiler -fopenmp source.cu -o program.x

The output I observe:

Single kernel takes 171.630663 ms to execute
Single kernel takes 169.587548 ms to execute
Single kernel takes 169.583722 ms to execute
Single kernel takes 169.586577 ms to execute
Single kernel takes 169.582600 ms to execute
run_long_kernel 0 submitted: 0.003076 ms
run_long_kernel 1 submitted: 0.003226 ms
run_long_kernel 2 submitted: 0.002345 ms
run_long_kernel 3 submitted: 0.002465 ms
run_long_kernel 4 submitted: 0.002595 ms
...
run_long_kernel 1019 submitted: 0.002094 ms
run_long_kernel 1020 submitted: 0.002104 ms
run_long_kernel 1021 submitted: 0.002094 ms
run_long_kernel 1022 submitted: 160.850246 ms
run_long_kernel 1023 submitted: 169.563043 ms
run_long_kernel 1024 submitted: 169.564625 ms
...
run_long_kernel 1048 submitted: 169.561680 ms
run_long_kernel 1049 submitted: 169.562552 ms
synchronization took 173304.017861 ms

This was run on A100-SXM4-40GB (cc8.0), centos 7, cuda 12.0, nvcc V12.0.76. On my laptop (GTX 1650 Ti (cc7.5), ubuntu 18.04 under wsl in win11, cuda 11.4, nvcc V11.5.119) it actually takes 4946 kernels before this problem occurs.

I also tested it using two streams, and it seems like the limit is per-stream. I have also tried submitting cudaMemcpyAsync along with the kernels, and it seems like what matters is the total number of operations in a stream, not divided between kernels and memory copies separately.

Yes, the internal launch queue has a limited size, and there is no official documentation about it. it is an implementation detail.

This issue has come up multiple times in these forums in the past, as well as on Stackoverflow. A question (with answer) on the latter from early in the life of CUDA is here, for example:

A more recent question (with answer) along the same lines is here:

A recent relevant question (with answers) in these forums is here:

Thanks a lot @njuffa for the links. I was searching for different keywords so I were not able to find it.

The common denominator I see in these discussions is that this behavior is undocumented. I think that it should be documented, at least partially, that something like that can happen. Many applications depend on submitting work for asynchronous execution and then doing something on the CPU while the GPU kernels are running.

BTW, I found a workaround to avoid the “synchronous” submitting after ~1000 kernels. Every 1000 kernels, record an event in the stream, create a new stream, in the new stream wait for that event, and start using the new stream instead of the old one (see source2.cu (2.4 KB)). However, since none of this is documented, I cannot rely on the 1000 being the right number, I actually cannot rely on any number. Some documentation about this would be really nice.

Since CUDA obviously knows that the internal queue is full, they should be able to just implement this exact procedure internally, removing the blocking behavior altogether with no need for workarounds. This would also remove the need for writing documentation on the internal queue, since there would then be no reason for that.

I will try submitting a bug and see what they can do about it.

It is in the nature of internal implementation artifacts that they are not publicly documented. The general arrangement and length of internal queues can and will change at vendor discretion. There can be no general expectation that simply creating a new stream to submit even more work is going to work reliably. There is a limited number of hardware queues underlying these software mechanisms.

In any programming environment, if a program continuously submits work requests at a higher rate than hardware can process one should expect that a stall will eventually occur, as queue and buffers designed to even out the flow are generally of finite length.

That said, in light of repeated questions both in these NVIDIA forums and on Stackoverflow it may be beneficial if NVIDIA were to add an explanatory sentence about the consequences of finite queuing to section “6.2.8.1 Concurrent Execution between Host and Device” of the CUDA Programing Guide. Consider filing a enhancement request with NVIDIA regarding such a change.