Device initialization takes 60 Seconds

Hi,

I don’t really know what additional info I can give, but please let me know what else you would need to know to resolve this.

I have a system with 4x RTX 3090 inside a GIGABYTE MZ52-G41-00. The first time I call any cuda function, it is very slow. Calling cudaSetDevice(0) the first time always takes 60 seconds (no variance), everything from then on seems to have runtimes that are expected.

Any suggestions what could be wrong, or is this to be expected (but 60s seems excessive) ?

Thanks!

Best,
Matthias

Are you running the persistence daemon?

https://docs.nvidia.com/deploy/driver-persistence/index.html

How much system memory is there? What is the CPU? Does this system have dual CPU sockets?

1 Like

Persistence mode seems to be enabled (but I don’t have permission to disable them, as I do not have sudo rights).
Would disabling this feature make things faster, I could ask my system admin to do this?

The system has 252G RAM and uses dual CPU sockets with 2 AMD EPYC 7313 16-Core Processors.

You want persistence turned on to prevent the driver from unloading when not in use. It seems that is already in place.

CUDA initializes lazily, triggered by the first API call. A long-standing “trick” is to call cudaFree(0) at a point in the program that is convenient to trigger CUA initilization.

Part of CUDA context creation and initialization involves mapping all system memory and all GPU memory into one unified virtual address space. In terms of duration, this is often by far the longest portion of the initialization process. The more total memory, the longer the mapping takes. Given the amount of memory in this system, 60 seconds for CUDA initialization does not strike me as extraordinary.

The mapping process consists mostly of operating system calls, and most of this work is single-threaded. Therefore single-threaded CPU performance will have the most impact on the speed of the mapping process, with some minor impact from system memory performance. I see that EPYC 7313 has a base frequency of 3.0 GHz, which is not too bad; for GPU-accelerated systems I usually recommend CPUs with a base frequency >= 3.5 GHz. With the GPU taking care of the part of the app that is parallelizable, CPU performance is crucial for the serial portion.

2 Likes

Thanks for your help.

We actually have a server with exactly the same specs and there device init takes < 100ms., which argues it’s not a CPU being too slow problem. Any suggestion where this massive slowdown could come from?

Computers are quite deterministic systems. If there is a significance difference in initialization time, there has to be a difference in hardware or software configuration somewhere. In other words, here has to be a logical explanation, and the two machines are not exactly the same in all aspects.

You will need to become a detective to find the salient difference. I realize that this can be a significant challenge, and you will have to make a judgement call as to how many resources to commit. Cross check hardware, make sure all software components have identical versions, pore over system logs to see whether any differences occur. One thing that is important in such investigations is that no assumptions should be made: the most harmlessly looking difference could turn out to be the culprit. This is the idea behind the “copy exactly” philosophy in manufacturing.

Thanks, I actually just fond the difference, which was persistence mode being enabled/ disabled.

Good that this fixed it, but I thought we had discussed persistence mode at the very start of this thread?