how long is the kernel launch queue these days?

My colleage who is learning CUDA tried to time the duration of doing 1000 kernel launches (a simple image processing kernel).


for (int i=0; i < 1000; i++)
kernel<<< >>>(…);


While she was getting 3ms in release mode, it took 200ms in debug mode. I explained to her that she needed a cudaThreadSynchronize() before stopping the timer because kernel launches are asynchronous and the CUDA utility library does implicit synchronization in debug builds only.

But my god, how long has the kernel launch queue become? I thought it could only queue 16 jobs before starting to block. It’s hard to imagine that 1000 kernel launches can be queued simultaneously. But the timing figures seem to indicate that (3ms vs 200ms!)


I’m not sure i follow the question here… Some darts in the dark:

Unless you set CUDA_LAUNCH_BLOCKING = 1 all the kernel calls are asynchronous, and no more than 16 kernels can be launched at a time on 2.x devices. In debug mode kernel launches are synchronous (cuda-gdb, CUDA Visual Profiler, Parallel Nsight).

Was she looking for the kernel launch overhead? Tim recently disclosed that these days its “way less than 4 us”.

I am puzzled because it is possible to launch one kernel a thousand times without getting blocking behavior.

The 3-4ms timing measured in release mode seem to nicely match 4us launch overhead * 1000.


Me too. Does the kernel actually write results back? Could it be that the kernel is completely optimized away unless optimization is disabled for debugging? Does she check return codes - maybe for some mysterious reasons the kernel aborts unless optimization is turned off?

I don’t know if there’s actually a size per se, but the number of kernels that can be queued is quite large. We try our best to not block unless we are absolutely out of space and there is nothing we can do about it except wait for some work to complete.