I am new to CUDA programming, but I am attempting to use a Tesla V100 on a headless Linux machine to take large FFTs. To get started, I made a simple program where I transfer data to the GPU, take an FFT and transfer it back to the host. I noticed in the visual profiler that my computations take about 200 msecs to run. However, the profiler shows a call to cudaFree that takes 1 second at the start of my program. I don’t have an explicit call to cudaFree, but the first thing I do is setup the FFT plan. Perhaps that function calls cudaFree which causes the OS to load the NVIDIA driver. Is 1 second a longer than normal initialization time? I tried using nvidia-smi -pm 1 (https://docs.nvidia.com/deploy/driver-persistence/index.html), but it did not have any effect on the runtime of my program.
I don’t completely understand the documentation I linked to above. Do I need to start the persistence daemon and enable persistence mode, or is nvidia-smi an alternate way to start the daemon?
Is there any other way to mitigate the slow initialization time?
CUDA requires a context which comprises all GPU state it needs to keep track of. It is initialized lazily at first use and not explicitly, i.e. there is no cudaContextInit() API call. The first CUDA API call triggers context initialization, which includes the creation of a memory map for unified address space via a large-ish number of operating system calls.
This mapping process maps all GPU and all host memory into a single virtual address map, and the time it takes is therefore roughly proportional to the total amount of memory that needs to be mapped. If this takes on the order of one second, I am guessing you have a system with large system memory, maybe around 256 GB, and / or multiple GPUs.
In order to control where this context-creation delay is incurred, it is a standard “trick” of CUDA programmers to issue a cudaFree(0) at an opportune moment during application startup.
As you apparently are aware, this initialization time could be even longer if the CUDA driver had been unloaded due to GPU inactivity, which can be prevented with the help of the persistence daemon, whose tasks it is to keep the driver resident.
If the system contains multiple GPUs that aren’t all needed, they can be excluded from CUDA use, and thus the mapping process, with the environment variable CUDA_VISIBLE_DEVICES (see documentation). This will reduce CUDA context initialization time.
The speed of the mapping via OS calls is primarily limited by single-thread CPU performance and secondarily by system memory throughput. For best results use a CPU with high base frequency (I suggest > 3.5 GHz for up to octa-core CPUs) and as fast a speed grade and as many memory channels of DDR4 as you can afford.