How to avoid the overhead in the beggining of every CUDA application?

johnkrantkrant · December 11, 2019, 3:23pm

Hi!
I observe that every CUDA application suffers from a big overhead (1 or 2 seconds) at the 1st operation of the CUDA API. This overhead happens with every CUDA operation like kernel submission, cudaMalloc and even with cudaGetDeviceProperties command. Additionally, when many different CUDA applictaion are submitted on the same GPU (with or without MPS), this overhead is increased.
I have generally seen in benchmarks that in order to measure the correct kernel or CUDA allocation latency, they warm up the NVIDIA driver in the beginning of the CUDA application with a dummy kernel that do nothing. But what is the point of that, if always the 1st cudaMalloc or the 1st CUDA kernel suffers from this overhead in real conditions?
In my case, that I am intesretd in the whole CUDA application’s performance, this overhead is a big problem when the latency of the most kernels, memory copies or alocations are in the range of 10ms to 100ms.

Is there any way to avoid this overhead outside the CUDA application(not the warming up dummy kernel inside the app)? I used CUDA MPS runtime in order to have always activated the GPU and the runtime and it reduced the overhead but it is still big (500 ms from 2000 ms). Additionally, why MPS reduced the overhead? And if it did because it warmed the CUDA driver, why didn’t it dissapear it?
Any ideas please?
Thank you!

mnicely · December 11, 2019, 5:14pm

Do you have any reproducer code? 10ms to 100ms seems a little excessive. Do you have any profile outputs? Kernel launches usually occur in microseconds.

Most articles suggest warm-up kernels when you are benchmarking your code. The same thing is done on CPU benchmarks as well.

If you’re using Tesla cards, you might try changing your persistence mode. https://docs.nvidia.com/deploy/driver-persistence/index.html

As far as cudaGetDeviceProperties, try https://devblogs.nvidia.com/cuda-pro-tip-the-fast-way-to-query-device-properties/

I’m not sure what your application is doing, but MPS allows kernel and memory copy operations from different processes to overlap on the GPU, achieving higher utilization and shorter running times. It’s not warming up the GPU.

Topic		Replies	Views
cudaMalloc takes several seconds CUDA Programming and Performance	6	2507	August 13, 2013
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23170	July 8, 2011
CudaMalloc is taking huge time for first time, How to overcome this issue CUDA Programming and Performance cuda	1	1050	April 12, 2021
Using GPU<->CPU polling to reduce overhead CUDA Programming and Performance	12	10431	November 21, 2007
Odd performance problem/question CUDA Programming and Performance	3	835	June 3, 2009
CUDA initialization takes long time that varies up to 30 seconds on Amazon p3.16xlarge Windows machi... CUDA Programming and Performance	5	1600	December 8, 2019
Do more parameters passed to kernel make it slower? CUDA Programming and Performance	9	6414	December 17, 2009
Cuda sporadically slows down CUDA Programming and Performance	2	872	May 28, 2018
First kernel execution takes longer CUDA Programming and Performance	8	2867	December 8, 2014
Performance measurement CUDA Programming and Performance	3	642	April 29, 2011

How to avoid the overhead in the beggining of every CUDA application?

Related topics