Kernel Launch Time Unexpectedly High

attanayakekavishka · July 11, 2022, 6:52am

Hi all,

Came across this issue while porting an existing sycl algorithm (clusterization) to cuda. I am a newbie contributor to Traccc, a parallelized track reconstruction algorithm for high energy physics experiments.
To briefly explain the entire process, (event) data is read from files and few algorithms (combination of few cuda kernels and sequential cpu code) are running on the loaded data. The main (driver code) iterates over events and runs the same algorithms on the event data.

Recorded times using std::chrono for a single event for clusterization algorithm in sycl (~0.02s), and the cuda time (~0.22s).However we expect a similar performance by both sycl and cuda. But when running as root user cuda time was (~0.02s) [Made this accidental discovery while profiling].
Profiling was done using nsight-sys as both root and non root user, because profiling as root fails to capture the high launch latency.

Non root Profiling results for 2 iterations(events) shows that first kernel (find_clusters) launch latency takes up a lot of the time on the on the first iteration (~0.2s) and settles down to a few microseconds for the following events.
2_Iterations_non-root.nsys-rep (4.3 MB)

When profiling as root for 2 iterations launch latency for this find clusters kernel in the first iterations is ~3ms and settles down to few microseconds for later iterations.
2_iterations_root.nsys-rep (4.2 MB)

I need some help understanding whats causing a very high cuda kernel launch time and how does using root reduce the kernel launch time significantly?

Link to codes in concern.
cuda implementation of clusterization - traccc/clusterization_algorithm.cu at clusterization-cuda-common · Chamodya-ka/traccc · GitHub
clusterization kernels - traccc/device/common/include/traccc/clusterization/device at clusterization-cuda-common · Chamodya-ka/traccc · GitHub
driver code - traccc/seq_example_cuda.cpp at clusterization-cuda-common · Chamodya-ka/traccc · GitHub

Device info: GTX1050Ti
CUDA : 11.6
OS : Ubuntu 18.04

njuffa · July 11, 2022, 8:12am

On my monitor the included images are undecipherable. I am going purely by the above description. What you should be expecting to see in CUDA is launch overhead on the order of 5 microseconds. On the first iteration, you should expect to see CUDA context initialization overhead. CUDA uses lazy initialization triggered by the first access to a CUDA API function. Other than for toy programs, this would typically not involve a kernel launch as such but a preceding call to a CUDA memory allocation function.

CUDA initialization overhead depends on a variety factors, including the number of GPUs in the system, the total size of all GPU memory, and the size of system memory. On large systems (say with 8 GPUs and 256 GB of system memory), context initialization time could be in the seconds. 200 milliseconds would be more typical for a workstation class system with one GPU.

The higher the single-thread performance of the host CPU, the lower context initialization will last, as this work is heavily dominated by single-threaded, OS-level activity. Generally speaking, high CPU base frequency translates to high single-thread performance; I recommend > 3.5 GHz.

A common technique to trigger CUDA context initialization at a time that is convenient to the application and outside any timed or benchmarked portion of code is to invoke cudaFree(0).

Note that Linux will typically unload the CUDA driver when not in use, and reloading it will add additional startup overhead. Use the persistence demon to keep the CUDA driver loaded at all times.

attanayakekavishka · July 11, 2022, 9:13am

Thanks alot njuffa for addressing my problem and giving a detailed explanation.

I was aware of using a dummy api call like cudaFree(0) to overcome this problem, and tested using it some time ago and apparently I got wrong results then which lead me into this confusion (Silly mistake on my part). Since you reconfirmed this I tested it once more and it did work as intended.

Any thoughts on why using sudo eliminates this initializing overhead?

njuffa · July 11, 2022, 9:21am

Nothing comes to mind. In my thinking one should not use sudo or use the root account except for specific administrative work, and certainly not for running CUDA applications. Not sure whether the Ubuntu crowd has a different philosophy on that, I try to stay as far away from Ubuntu as possible.

But there are many Ubuntu users that frequent these forum that may have some insights into that.

system · July 25, 2022, 9:21am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23168	July 8, 2011
64 bit Windows 10, gtx 1060, CUDA kernel startup time? CUDA Programming and Performance	12	2843	October 10, 2017
Very slow kernel launches CUDA Programming and Performance	8	7751	March 28, 2015
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12726	July 26, 2010
unspecified launch failure kernel fails if a loop is too long CUDA Programming and Performance	8	42843	April 25, 2007
reduce overhead of launching a new thread block CUDA Programming and Performance	15	4647	February 15, 2018
Unusual timing results CUDA Programming and Performance	14	2831	July 7, 2009
Kernels in CUDA streams seems not running in parallel Profiling Linux Targets	8	830	April 7, 2024
Need solution of "kernel launch timeout" from NVIDIA CUDA Programming and Performance	11	19381	March 4, 2009
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1763	July 19, 2022

Kernel Launch Time Unexpectedly High

Related topics