Device initialization takes 60 Seconds

matthiaas · June 26, 2023, 9:38am

Hi,

I don’t really know what additional info I can give, but please let me know what else you would need to know to resolve this.

I have a system with 4x RTX 3090 inside a GIGABYTE MZ52-G41-00. The first time I call any cuda function, it is very slow. Calling cudaSetDevice(0) the first time always takes 60 seconds (no variance), everything from then on seems to have runtimes that are expected.

Any suggestions what could be wrong, or is this to be expected (but 60s seems excessive) ?

Thanks!

Best,
Matthias

njuffa · June 26, 2023, 10:17am

Are you running the persistence daemon?

https://docs.nvidia.com/deploy/driver-persistence/index.html

How much system memory is there? What is the CPU? Does this system have dual CPU sockets?

matthiaas · June 26, 2023, 10:29am

Persistence mode seems to be enabled (but I don’t have permission to disable them, as I do not have sudo rights).
Would disabling this feature make things faster, I could ask my system admin to do this?

The system has 252G RAM and uses dual CPU sockets with 2 AMD EPYC 7313 16-Core Processors.

njuffa · June 26, 2023, 10:45am

You want persistence turned on to prevent the driver from unloading when not in use. It seems that is already in place.

CUDA initializes lazily, triggered by the first API call. A long-standing “trick” is to call cudaFree(0) at a point in the program that is convenient to trigger CUA initilization.

Part of CUDA context creation and initialization involves mapping all system memory and all GPU memory into one unified virtual address space. In terms of duration, this is often by far the longest portion of the initialization process. The more total memory, the longer the mapping takes. Given the amount of memory in this system, 60 seconds for CUDA initialization does not strike me as extraordinary.

The mapping process consists mostly of operating system calls, and most of this work is single-threaded. Therefore single-threaded CPU performance will have the most impact on the speed of the mapping process, with some minor impact from system memory performance. I see that EPYC 7313 has a base frequency of 3.0 GHz, which is not too bad; for GPU-accelerated systems I usually recommend CPUs with a base frequency >= 3.5 GHz. With the GPU taking care of the part of the app that is parallelizable, CPU performance is crucial for the serial portion.

matthiaas · July 24, 2023, 7:40am

Thanks for your help.

We actually have a server with exactly the same specs and there device init takes < 100ms., which argues it’s not a CPU being too slow problem. Any suggestion where this massive slowdown could come from?

njuffa · July 24, 2023, 9:40am

Computers are quite deterministic systems. If there is a significance difference in initialization time, there has to be a difference in hardware or software configuration somewhere. In other words, here has to be a logical explanation, and the two machines are not exactly the same in all aspects.

You will need to become a detective to find the salient difference. I realize that this can be a significant challenge, and you will have to make a judgement call as to how many resources to commit. Cross check hardware, make sure all software components have identical versions, pore over system logs to see whether any differences occur. One thing that is important in such investigations is that no assumptions should be made: the most harmlessly looking difference could turn out to be the culprit. This is the idea behind the “copy exactly” philosophy in manufacturing.

matthiaas · July 24, 2023, 9:41am

Thanks, I actually just fond the difference, which was persistence mode being enabled/ disabled.

njuffa · July 24, 2023, 9:43am

Good that this fixed it, but I thought we had discussed persistence mode at the very start of this thread?

Topic		Replies	Views
Strange delay on CUDA initialization CUDA Programming and Performance	6	20661	November 30, 2011
cuda initialization takes too much time CUDA Programming and Performance	5	2646	August 27, 2017
Runtime initialization slow (1 sec) on 400-500 series cards, very slow (5 sec) with CUDA 3.2 CUDA Programming and Performance	5	5645	April 22, 2011
Slow CUDA programs' startup CUDA Programming and Performance	10	7367	January 23, 2012
Persistence Daemon and Slow Initialization CUDA Programming and Performance	1	1187	December 18, 2018
Slow Initialization CUDA Programming and Performance	7	2781	July 30, 2009
First CUDA call takes 13 seconds CUDA Programming and Performance	6	4402	July 2, 2015
Initialization time on GTX 460 CUDA Programming and Performance	17	8664	November 9, 2011
cuInit taking a long time? cuInit taking a second CUDA Programming and Performance	6	5370	November 14, 2007
cuInit or cudaSetDevice is horribly slow on Kepler K20c, fast on Fermi S2050 CUDA Programming and Performance	4	1922	June 17, 2014

Device initialization takes 60 Seconds

Related topics