CUDA Graphs Impact

uniadam · September 17, 2021, 5:51pm

Hi,

I have a CUDA code that is using multi streams to perform operations. The kernel launch time most of the time is around 5µs and the largest one is around 14µs and minimum is 770 ns. I should mention that kernel execution time is longer than launching kernel (at least 1.5x longer and generally 200x longer).

Do I have any chance to improve performance and kernel launch time with CUDA Graphs concept?
Can CUDA Graphs cause an increase in total execution time? or does it have overhead in my situation?
How can I have a precise analysis about kernel launch time (something automatic and not by hand)?

njuffa · September 17, 2021, 6:38pm

Kernel launch times around 5µs for null kernels (kernels that do nothing) on PCIe gen3 hardware have been the standard for the past decade. It is my understanding that this is primarily a function of various hardware latencies associated with PCIe, with some minor impact from CPU performance in the form of CUDA driver overhead. There are some additional overhead issues on Windows with WDDM driver, where the CUDA driver uses launch batching. Are you on Linux or Windows?

Assuming this is Linux: How confident are you about the measurement methodology that found a minimum launch time of 770 ns? I find that number to be improbably low, but I do not have access to a system with the latest PCIe gen4 hardware, so maybe this is a thing now. If you are on Windows with a WDDM driver, the short and long launch times observed are likely artifacts caused by batching. Consider switching to the TCC driver if possible.

Because of the kernel launch overhead it is, generally speaking, not a good idea to use extremely short running kernels. Kernel runtime of 10 ms on the fastest GPU models of a GPU generation (which then run around 100ms on the slowest GPUs of that generation) seem like a good target which practically eliminates any impact from kernel launch overhead.

uniadam · September 17, 2021, 6:52pm

Thanks for fast reply. I am using CUDA-11.4 with Linux + V100 GPU PCIe gen3.
For measuring time I am using nsys-ui and after profiling I am going to open the profile and check the time. I am not sure that if this method is correct.

For example here:

Topic		Replies	Views
cudaLaunch overheads CUDA Programming and Performance	2	756	August 27, 2018
Advantage of Cuda Graphs? CUDA Programming and Performance	3	747	June 28, 2023
Dispatch Kernel Overhead (OpenCL) CUDA Programming and Performance	6	3603	March 28, 2017
Very slow kernel launches CUDA Programming and Performance	8	7742	March 28, 2015
Kernel enqueue overhead Bringing kernel overhead down? CUDA Programming and Performance	9	13742	March 12, 2010
Kernel Launch Time (CPU Time) Reported in Visual Profiler how to optimize kernel launch CUDA Programming and Performance	1	683	July 7, 2011
What could be possible reasons for affecting the kernel launch overhead for fast small kernels? CUDA Programming and Performance	5	31	October 22, 2024
First kernel execution takes longer CUDA Programming and Performance	8	2865	December 8, 2014
Performance of Lauching Kernels CUDA Programming and Performance	3	2000	April 10, 2014
kernel launch time way too long CUDA Programming and Performance	6	4026	July 5, 2011

CUDA Graphs Impact

Related topics