Hello,
I am currently programming an HPC application where I submit many operations (my own kernels, cublas/cusparse/cusolver routines) into a single stream. I ran into a problem that the submission starts being synchronous after a while. The kernel launch should be asynchronous, but I saw that it took “a lot of time” to launch the kernel.
I boiled the problem down to a simple program (see below), where I found out that after I submit about 1022 kernels into a stream (no difference between a null and a non-null stream), the kernel launches then start taking a long time. My guess is that there is that there is a limit to how many operations can be in a stream, and if the limit is exceeded, it has to wait for the first operation to finish, to be able to enqueue a new one.
I have not been able to find anything describing such or similar behavior in the documentation nor anywhere on the internet (just an unanswered question here). Table 15 in the programming guide did not help. The closest is “Maximum number of resident grids per device (Concurrent Kernel Execution)”, but it is something different, and the numbers don’t match up anyway.
Could anyone please explain/confirm this behavior, or direct me to where I could find some info about this?
Thanks,
Jakub
The program I am using: source.cu (1.7 KB)
Compiling it with nvcc -g -O2 -Xcompiler -fopenmp source.cu -o program.x
The output I observe:
Single kernel takes 171.630663 ms to execute
Single kernel takes 169.587548 ms to execute
Single kernel takes 169.583722 ms to execute
Single kernel takes 169.586577 ms to execute
Single kernel takes 169.582600 ms to execute
run_long_kernel 0 submitted: 0.003076 ms
run_long_kernel 1 submitted: 0.003226 ms
run_long_kernel 2 submitted: 0.002345 ms
run_long_kernel 3 submitted: 0.002465 ms
run_long_kernel 4 submitted: 0.002595 ms
...
run_long_kernel 1019 submitted: 0.002094 ms
run_long_kernel 1020 submitted: 0.002104 ms
run_long_kernel 1021 submitted: 0.002094 ms
run_long_kernel 1022 submitted: 160.850246 ms
run_long_kernel 1023 submitted: 169.563043 ms
run_long_kernel 1024 submitted: 169.564625 ms
...
run_long_kernel 1048 submitted: 169.561680 ms
run_long_kernel 1049 submitted: 169.562552 ms
synchronization took 173304.017861 ms
This was run on A100-SXM4-40GB (cc8.0), centos 7, cuda 12.0, nvcc V12.0.76. On my laptop (GTX 1650 Ti (cc7.5), ubuntu 18.04 under wsl in win11, cuda 11.4, nvcc V11.5.119) it actually takes 4946 kernels before this problem occurs.
I also tested it using two streams, and it seems like the limit is per-stream. I have also tried submitting cudaMemcpyAsync
along with the kernels, and it seems like what matters is the total number of operations in a stream, not divided between kernels and memory copies separately.