Came across this issue while porting an existing sycl algorithm (clusterization) to cuda. I am a newbie contributor to Traccc, a parallelized track reconstruction algorithm for high energy physics experiments.
To briefly explain the entire process, (event) data is read from files and few algorithms (combination of few cuda kernels and sequential cpu code) are running on the loaded data. The main (driver code) iterates over events and runs the same algorithms on the event data.
Recorded times using std::chrono for a single event for clusterization algorithm in sycl (~0.02s), and the cuda time (~0.22s).However we expect a similar performance by both sycl and cuda. But when running as root user cuda time was (~0.02s) [Made this accidental discovery while profiling].
Profiling was done using nsight-sys as both root and non root user, because profiling as root fails to capture the high launch latency.
Non root Profiling results for 2 iterations(events) shows that first kernel (find_clusters) launch latency takes up a lot of the time on the on the first iteration (~0.2s) and settles down to a few microseconds for the following events.
2_Iterations_non-root.nsys-rep (4.3 MB)
When profiling as root for 2 iterations launch latency for this find clusters kernel in the first iterations is ~3ms and settles down to few microseconds for later iterations.
2_iterations_root.nsys-rep (4.2 MB)
I need some help understanding whats causing a very high cuda kernel launch time and how does using root reduce the kernel launch time significantly?
Link to codes in concern.
cuda implementation of clusterization - traccc/clusterization_algorithm.cu at clusterization-cuda-common · Chamodya-ka/traccc · GitHub
clusterization kernels - traccc/device/common/include/traccc/clusterization/device at clusterization-cuda-common · Chamodya-ka/traccc · GitHub
driver code - traccc/seq_example_cuda.cpp at clusterization-cuda-common · Chamodya-ka/traccc · GitHub
Device info: GTX1050Ti
CUDA : 11.6
OS : Ubuntu 18.04
On my monitor the included images are undecipherable. I am going purely by the above description. What you should be expecting to see in CUDA is launch overhead on the order of 5 microseconds. On the first iteration, you should expect to see CUDA context initialization overhead. CUDA uses lazy initialization triggered by the first access to a CUDA API function. Other than for toy programs, this would typically not involve a kernel launch as such but a preceding call to a CUDA memory allocation function.
CUDA initialization overhead depends on a variety factors, including the number of GPUs in the system, the total size of all GPU memory, and the size of system memory. On large systems (say with 8 GPUs and 256 GB of system memory), context initialization time could be in the seconds. 200 milliseconds would be more typical for a workstation class system with one GPU.
The higher the single-thread performance of the host CPU, the lower context initialization will last, as this work is heavily dominated by single-threaded, OS-level activity. Generally speaking, high CPU base frequency translates to high single-thread performance; I recommend > 3.5 GHz.
A common technique to trigger CUDA context initialization at a time that is convenient to the application and outside any timed or benchmarked portion of code is to invoke
Note that Linux will typically unload the CUDA driver when not in use, and reloading it will add additional startup overhead. Use the persistence demon to keep the CUDA driver loaded at all times.
Thanks alot njuffa for addressing my problem and giving a detailed explanation.
I was aware of using a dummy api call like cudaFree(0) to overcome this problem, and tested using it some time ago and apparently I got wrong results then which lead me into this confusion (Silly mistake on my part). Since you reconfirmed this I tested it once more and it did work as intended.
Any thoughts on why using sudo eliminates this initializing overhead?
Nothing comes to mind. In my thinking one should not use
sudo or use the
root account except for specific administrative work, and certainly not for running CUDA applications. Not sure whether the Ubuntu crowd has a different philosophy on that, I try to stay as far away from Ubuntu as possible.
But there are many Ubuntu users that frequent these forum that may have some insights into that.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.