Persistence Daemon and Slow Initialization

aarsmith54 · December 18, 2018, 12:55am

Hello all,

I am new to CUDA programming, but I am attempting to use a Tesla V100 on a headless Linux machine to take large FFTs. To get started, I made a simple program where I transfer data to the GPU, take an FFT and transfer it back to the host. I noticed in the visual profiler that my computations take about 200 msecs to run. However, the profiler shows a call to cudaFree that takes 1 second at the start of my program. I don’t have an explicit call to cudaFree, but the first thing I do is setup the FFT plan. Perhaps that function calls cudaFree which causes the OS to load the NVIDIA driver. Is 1 second a longer than normal initialization time? I tried using nvidia-smi -pm 1 (https://docs.nvidia.com/deploy/driver-persistence/index.html), but it did not have any effect on the runtime of my program.

I don’t completely understand the documentation I linked to above. Do I need to start the persistence daemon and enable persistence mode, or is nvidia-smi an alternate way to start the daemon?

Is there any other way to mitigate the slow initialization time?

Thanks!

njuffa · December 18, 2018, 1:09am

CUDA requires a context which comprises all GPU state it needs to keep track of. It is initialized lazily at first use and not explicitly, i.e. there is no cudaContextInit() API call. The first CUDA API call triggers context initialization, which includes the creation of a memory map for unified address space via a large-ish number of operating system calls.

This mapping process maps all GPU and all host memory into a single virtual address map, and the time it takes is therefore roughly proportional to the total amount of memory that needs to be mapped. If this takes on the order of one second, I am guessing you have a system with large system memory, maybe around 256 GB, and / or multiple GPUs.

In order to control where this context-creation delay is incurred, it is a standard “trick” of CUDA programmers to issue a cudaFree(0) at an opportune moment during application startup.

As you apparently are aware, this initialization time could be even longer if the CUDA driver had been unloaded due to GPU inactivity, which can be prevented with the help of the persistence daemon, whose tasks it is to keep the driver resident.

If the system contains multiple GPUs that aren’t all needed, they can be excluded from CUDA use, and thus the mapping process, with the environment variable CUDA_VISIBLE_DEVICES (see documentation). This will reduce CUDA context initialization time.

The speed of the mapping via OS calls is primarily limited by single-thread CPU performance and secondarily by system memory throughput. For best results use a CPU with high base frequency (I suggest > 3.5 GHz for up to octa-core CPUs) and as fast a speed grade and as many memory channels of DDR4 as you can afford.

Topic		Replies	Views
Device initialization takes 60 Seconds CUDA Programming and Performance	7	619	July 24, 2023
Runtime initialization slow (1 sec) on 400-500 series cards, very slow (5 sec) with CUDA 3.2 CUDA Programming and Performance	5	5608	April 22, 2011
why 2.9 seconds to start tesla K20 CUDA Programming and Performance	12	1260	March 4, 2018
CudaMalloc is taking huge time for first time, How to overcome this issue CUDA Programming and Performance cuda	1	1067	April 12, 2021
CUDA initialization takes long time that varies up to 30 seconds on Amazon p3.16xlarge Windows machi... CUDA Programming and Performance	5	1618	December 8, 2019
Slow CUDA programs' startup CUDA Programming and Performance	10	7286	January 23, 2012
cudaSetDevice() time, so weird! cudaSetDevice() take a long time. CUDA Programming and Performance	10	4621	August 2, 2010
CudaMalloc is too expensive and GPU Memories CUDA Programming and Performance	6	2760	January 22, 2016
Is there any possibility to create constexpr CUDA resource allocation? CUDA Programming and Performance	3	28	October 17, 2024
64 bit Windows 10, gtx 1060, CUDA kernel startup time? CUDA Programming and Performance	12	2849	October 10, 2017

Persistence Daemon and Slow Initialization

Related topics