My colleage who is learning CUDA tried to time the duration of doing 1000 kernel launches (a simple image processing kernel).
for (int i=0; i < 1000; i++)
While she was getting 3ms in release mode, it took 200ms in debug mode. I explained to her that she needed a cudaThreadSynchronize() before stopping the timer because kernel launches are asynchronous and the CUDA utility library does implicit synchronization in debug builds only.
But my god, how long has the kernel launch queue become? I thought it could only queue 16 jobs before starting to block. It’s hard to imagine that 1000 kernel launches can be queued simultaneously. But the timing figures seem to indicate that (3ms vs 200ms!)