Exponential initialization timings with multiple GPUs


I notice an extreme difference in initialization timings between driver versions when using multiple GPUs.
The initialization of the GPUs typically takes some time on first run, using driver version 375.10 we have been able to address this problem using the persistence daemon (second run is fast).
However, newer driver versions e.g. 384.69 appear to have a load time exponential in the number of GPUs attached to a single host and these timings are no longer acceptable.

The same exponential behavior seems to be present in both driver versions (375.10 & 384.69), yet are exaggerated in the latter. When using 16 Nvidia GTX 980 GPUs in a single system, we used to experience loading times up to ~30 seconds (375.10), now these are expanded to ~175 seconds (384.69). After the 13th GPU, the initialization times for each additional GPU follow an exponential function. The first driver where this behavior occurs seems to be 375.20, and again are magnified in the newer driver versions.

As a test application we run nvidia-smi without any additional parameters. The nvidia-smi process uses 100% of the CPU for most of the initialization period and the system is irresponsible for a while until initialization has ended.

Our system runs a PCI-e switched environment, our hypothesis is that the problem is either caused by GPU p2p communication which times out (interrupt timeout, polling the GPU in a busy wait loop) or by a resource exhaustion problem (not able to address enough IOMMU memory).

Our base OS is Fedora, we have performed extensive tests (all recent driver versions, 375, 378, 381, 384) on two different Fedora versions FC23 and FC26, using different kernel versions (4.2.3 & 4.11.8). The same problem occurs on Ubuntu 16.04 (stock kernel) and Debian (kernel 3.16.0).

I’ve profiled and straced the latest driver (384.69) and most of the time seems to be spent in the recursive _nv028404rm() function while opening a Nvidia device e.g. /dev/nvidia15

Is it possible e.g. to disable the p2p communication/peer discovery between GPUs as this is not required for our application? I’ve tried all the driver options (NVreg) but these do not seem to make a difference on the initialization timings.

Could you provide an explanation on the differences in loading times, and the initialization routines, between driver versions that seems to cause this behavior?
Or could you address the cause of this issue (provide a solution)?

We can provide more details/debug reports you might need, preferably through email.


I’ve compiled two graphs based on the initialization times per GPU saved in dmesg, one of the 375.10 driver and one based on the 384.69 driver.

Note the difference in loading times…

Can I get sample application to reproduce this issue? ALso please attach nvidia bug report log file to your existing post. What is the minimum number of gpu with which we can reproduce this issue - 2 or 3 or more ?

The sample application is nvidia-smi, which is included with the driver.

I’ve sent a bug-report to ‘linux-bugs@nvidia.com’ with as subject “Bugreport: Exponential initialization timings with multiple GPUs” (dated 7 sept.); find it using my email address.

You can really start to notice this problem above 11 GTX980 GPUs, the problem also occurs with Tesla M2090 GPUs only with smaller delays between GPUs (yet still exponential).

I’m happy to help with any debugging / troubleshooting efforts.


May I know how you are testing and calculating load & the initialization time? Is the persistence mode enabled? Please provide reproduction steps you are following to trigger this issue in detail.

I test the initialization time by loading the driver with the “NVreg_ResmanDebugLevel=0” flag, enabling debug output. The initialization times are taken from the driver’s debug kernel messages (dmesg), there the message “NVRM: opening device bearing minor number” is printed when the driver accesses each GPU on initialization (loading the kernel to the GPU?).

The reproduction steps are simple, on a clean Linux installation, install the driver, and run nvidia-smi. The call to nvidia-smi will take significantly longer for the 384.69 driver than the 375.10 driver.

Persistence mode is not enabled in my current tests. When I use the persistence mode daemon, the initialization is done by the daemon and yields the same timing results.

Can you suggest a different application to test the initialization timings, apart from nvidia-smi or the persistence-daemon?


Are there any additional tests we can do to help you reproduce / debug this problem? Also related to the question of using a different testing application from the previous post.


How many Tesla GPUs are running? In case of Tesla when the problem start noticing ?

These are the figures for the Tesla GPUs

Like I said the problem of exponential timings still occurs.


We tried repro on 2 configs with driver r384 but not able to reproduce this issue.

Config 1.
OS: ubuntu 14.04.5
GPU: k80x8
Driver: 384
Machine: supermicro SYS-4028GR-TR2
Time: 28seconds (pm=0) /0.5seconds ( pm=1)

Config 2.
OS: ubuntu 16.04.3
GPU: gtx1080 ti x4+ titan xp x4
Driver: 384
Machine: supermicro SYS-4028GR-TR
Time: 14 seconds (pm=0) / 0.25 seconds (pm=1)

Looks this issue specif to your config/setup or gpus.

Thank you for the effort in trying to reproduce our issue, much appreciated. However, the fact that you are unable to reproduce our issue is caused by two factors:

  1. The number of GPUs used is not enough to notice the problem; as you can see in the graphs above the bug/problem only becomes noticeable when more than 11 GPUs are used.
  2. Our systems heavily depend on a PCIe infrastructure with multiple layers of PCIe switches, whereas the supermicro SYS-4028GR-TR2 system seems to include none.

My questions:

  1. Is it in any way possible to disable GPUdirect / RDMA / all p2p communication between the GPUs as this might resolve the problem?
  2. Do you have any switched PCIe system available which can take 11 GPUs or more (e.g. Dell C410x or One Stop Systems CA16007)?
  3. Can you perhaps provide the p2pbandwith and p2platency tests for your setup of the CUDAtoolkit. It would be nice if I could compare your latency results with mine.

I will try to reproduce this issue on a system with GTX1080 GPUs and on a system which has less PCIe switches. I will also try the latest 384.90 driver.


I’ve been able to reproduce this issue both on a system containing GTX1080 GPUs and on a system using less PCIe switches. The issue still persists.

A system containing 18 GTX1080 GPUs takes 39 seconds with the 375.10 driver, but with the 384.69 driver takes 19 minutes to load nvidia-smi.
Attached you can find the p2p bandwidth and latency test.

Is it possible to disable p2p communication when initializing the driver?


p2pBandwidthLatencyTest_18GPU_GTX1080_debian.txt (29.4 KB)

I’ve tested the setup again using the latest driver 387.12.

The latest driver seems to exacerbate the problem even further. The last GPU takes 500 seconds to initialize as opposed to 300 with the 384.69 and 5 seconds with the 375.10 driver…

Could you please answer my previous questions related to p2p and switches?


The problem still persists. The driver takes respectively 1290.18 seconds with the 390.12 driver and 783.69 seconds with the 384.111 driver.

Any pointers to the possible underlying problem or a solution are appreciated.


In the meanwhile I’ve been testing different drivers. I’ve also tested it for the 387.22, 387.34, 390.25 driver, where the problem still exits.

Nevertheless, from 390.42 up to the latest driver the problem is fixed :)!
I’ve also attached my test results for the 390.48, 390.59, 396.18, and 396.24 drivers.

Typical cumulative loading times now take far less than a minute, approximately 20 seconds, as opposed to more than 20 hours. Good job and thank you very much for your efforts!