I notice an extreme difference in initialization timings between driver versions when using multiple GPUs.
The initialization of the GPUs typically takes some time on first run, using driver version 375.10 we have been able to address this problem using the persistence daemon (second run is fast).
However, newer driver versions e.g. 384.69 appear to have a load time exponential in the number of GPUs attached to a single host and these timings are no longer acceptable.
The same exponential behavior seems to be present in both driver versions (375.10 & 384.69), yet are exaggerated in the latter. When using 16 Nvidia GTX 980 GPUs in a single system, we used to experience loading times up to ~30 seconds (375.10), now these are expanded to ~175 seconds (384.69). After the 13th GPU, the initialization times for each additional GPU follow an exponential function. The first driver where this behavior occurs seems to be 375.20, and again are magnified in the newer driver versions.
As a test application we run nvidia-smi without any additional parameters. The nvidia-smi process uses 100% of the CPU for most of the initialization period and the system is irresponsible for a while until initialization has ended.
Our system runs a PCI-e switched environment, our hypothesis is that the problem is either caused by GPU p2p communication which times out (interrupt timeout, polling the GPU in a busy wait loop) or by a resource exhaustion problem (not able to address enough IOMMU memory).
Our base OS is Fedora, we have performed extensive tests (all recent driver versions, 375, 378, 381, 384) on two different Fedora versions FC23 and FC26, using different kernel versions (4.2.3 & 4.11.8). The same problem occurs on Ubuntu 16.04 (stock kernel) and Debian (kernel 3.16.0).
I’ve profiled and straced the latest driver (384.69) and most of the time seems to be spent in the recursive _nv028404rm() function while opening a Nvidia device e.g. /dev/nvidia15
Is it possible e.g. to disable the p2p communication/peer discovery between GPUs as this is not required for our application? I’ve tried all the driver options (NVreg) but these do not seem to make a difference on the initialization timings.
Could you provide an explanation on the differences in loading times, and the initialization routines, between driver versions that seems to cause this behavior?
Or could you address the cause of this issue (provide a solution)?
We can provide more details/debug reports you might need, preferably through email.