CUDA Graphs Impact

Hi,

I have a CUDA code that is using multi streams to perform operations. The kernel launch time most of the time is around 5µs and the largest one is around 14µs and minimum is 770 ns. I should mention that kernel execution time is longer than launching kernel (at least 1.5x longer and generally 200x longer).

  • Do I have any chance to improve performance and kernel launch time with CUDA Graphs concept?

  • Can CUDA Graphs cause an increase in total execution time? or does it have overhead in my situation?

  • How can I have a precise analysis about kernel launch time (something automatic and not by hand)?

Kernel launch times around 5µs for null kernels (kernels that do nothing) on PCIe gen3 hardware have been the standard for the past decade. It is my understanding that this is primarily a function of various hardware latencies associated with PCIe, with some minor impact from CPU performance in the form of CUDA driver overhead. There are some additional overhead issues on Windows with WDDM driver, where the CUDA driver uses launch batching. Are you on Linux or Windows?

Assuming this is Linux: How confident are you about the measurement methodology that found a minimum launch time of 770 ns? I find that number to be improbably low, but I do not have access to a system with the latest PCIe gen4 hardware, so maybe this is a thing now. If you are on Windows with a WDDM driver, the short and long launch times observed are likely artifacts caused by batching. Consider switching to the TCC driver if possible.

Because of the kernel launch overhead it is, generally speaking, not a good idea to use extremely short running kernels. Kernel runtime of 10 ms on the fastest GPU models of a GPU generation (which then run around 100ms on the slowest GPUs of that generation) seem like a good target which practically eliminates any impact from kernel launch overhead.

Thanks for fast reply. I am using CUDA-11.4 with Linux + V100 GPU PCIe gen3.
For measuring time I am using nsys-ui and after profiling I am going to open the profile and check the time. I am not sure that if this method is correct.

For example here: