How to avoid the overhead in the beggining of every CUDA application?

Hi!
I observe that every CUDA application suffers from a big overhead (1 or 2 seconds) at the 1st operation of the CUDA API. This overhead happens with every CUDA operation like kernel submission, cudaMalloc and even with cudaGetDeviceProperties command. Additionally, when many different CUDA applictaion are submitted on the same GPU (with or without MPS), this overhead is increased.
I have generally seen in benchmarks that in order to measure the correct kernel or CUDA allocation latency, they warm up the NVIDIA driver in the beginning of the CUDA application with a dummy kernel that do nothing. But what is the point of that, if always the 1st cudaMalloc or the 1st CUDA kernel suffers from this overhead in real conditions?
In my case, that I am intesretd in the whole CUDA application’s performance, this overhead is a big problem when the latency of the most kernels, memory copies or alocations are in the range of 10ms to 100ms.

Is there any way to avoid this overhead outside the CUDA application(not the warming up dummy kernel inside the app)? I used CUDA MPS runtime in order to have always activated the GPU and the runtime and it reduced the overhead but it is still big (500 ms from 2000 ms). Additionally, why MPS reduced the overhead? And if it did because it warmed the CUDA driver, why didn’t it dissapear it?
Any ideas please?
Thank you!

Do you have any reproducer code? 10ms to 100ms seems a little excessive. Do you have any profile outputs? Kernel launches usually occur in microseconds.

Most articles suggest warm-up kernels when you are benchmarking your code. The same thing is done on CPU benchmarks as well.

If you’re using Tesla cards, you might try changing your persistence mode. https://docs.nvidia.com/deploy/driver-persistence/index.html

As far as cudaGetDeviceProperties, try https://devblogs.nvidia.com/cuda-pro-tip-the-fast-way-to-query-device-properties/

I’m not sure what your application is doing, but MPS allows kernel and memory copy operations from different processes to overlap on the GPU, achieving higher utilization and shorter running times. It’s not warming up the GPU.