I am new to CUDA programming, but I am attempting to use a Tesla V100 on a headless Linux machine to take large FFTs. To get started, I made a simple program where I transfer data to the GPU, take an FFT and transfer it back to the host. I noticed in the visual profiler that my computations take about 200 msecs to run. However, the profiler shows a call to cudaFree that takes 1 second at the start of my program. I don’t have an explicit call to cudaFree, but the first thing I do is setup the FFT plan. Perhaps that function calls cudaFree which causes the OS to load the NVIDIA driver. Is 1 second a longer than normal initialization time? I tried using nvidia-smi -pm 1 (https://docs.nvidia.com/deploy/driver-persistence/index.html), but it did not have any effect on the runtime of my program.
I don’t completely understand the documentation I linked to above. Do I need to start the persistence daemon and enable persistence mode, or is nvidia-smi an alternate way to start the daemon?
Is there any other way to mitigate the slow initialization time?