Painfully long driver initialization with many GPUs -- affects ALL drivers (Nvidia, please do someth...

Normz · January 21, 2018, 7:25pm

EDIT 2018-09-01: This bug was fixed in 390.x but then was reintroduced in 396.x Beta :( … if you suffer from this bug, stick to 390.x (I’m using 390.48).

Ubuntu 16.04 here, tried many kernels too, but it happened with other distros. I’m taking the time to post this after many months of frustration.

I have motherboards with as many as 13 GPUs but also have mobos with 7 or 8 GPUs.

The nvidia driver seems to load the GPUs sequentially and it adds an exponentially increasing delay between each card. For 13 GPUs it used to take about 40 seconds on the v384 drivers and older, which is already very long! Now with the release of v387 and 390 drivers it takes a whopping 81 seconds!

Here’s a kernel log:

Jan 21 07:51:52 m8 kernel: [197026.537238] nvidia-modeset: Allocated GPU:0 (GPU-177a47dc-f3f8-b480-0f2a-e223c6874e91) @ PCI:0000:01:00.0
Jan 21 07:51:53 m8 kernel: [197027.475759] nvidia-modeset: Allocated GPU:1 (GPU-943a5e69-78c3-51a5-c0ed-d7f314655bab) @ PCI:0000:02:00.0
Jan 21 07:51:54 m8 kernel: [197028.379958] nvidia-modeset: Allocated GPU:2 (GPU-fb4e4bae-5192-590f-3151-b793f2aaaec6) @ PCI:0000:03:00.0
Jan 21 07:51:55 m8 kernel: [197029.338908] nvidia-modeset: Allocated GPU:3 (GPU-e7a64c57-e302-52d7-908a-3c10b3d33544) @ PCI:0000:04:00.0
Jan 21 07:51:56 m8 kernel: [197030.292160] nvidia-modeset: Allocated GPU:4 (GPU-af125524-a38d-aa30-849a-210db7f73f2f) @ PCI:0000:05:00.0
Jan 21 07:51:57 m8 kernel: [197031.325380] nvidia-modeset: Allocated GPU:5 (GPU-8b7a4614-8f29-9c2f-0f1c-aa15d87bb934) @ PCI:0000:06:00.0
Jan 21 07:51:59 m8 kernel: [197032.623103] nvidia-modeset: Allocated GPU:6 (GPU-5e72c389-98bc-9af3-c875-dda1baa09120) @ PCI:0000:09:00.0
Jan 21 07:52:00 m8 kernel: [197034.333221] nvidia-modeset: Allocated GPU:7 (GPU-095a5f5c-a8a9-f5a3-d55a-0153f8ed9e1f) @ PCI:0000:0a:00.0
Jan 21 07:52:03 m8 kernel: [197036.970865] nvidia-modeset: Allocated GPU:8 (GPU-d2c07689-13c9-fd1f-d3b3-f1c52e419114) @ PCI:0000:0b:00.0
Jan 21 07:52:08 m8 kernel: [197041.947771] nvidia-modeset: Allocated GPU:9 (GPU-b77a08ed-2317-f267-7555-cb2fcce58f81) @ PCI:0000:0c:00.0
Jan 21 07:52:17 m8 kernel: [197051.463534] nvidia-modeset: Allocated GPU:10 (GPU-4fd72b20-d503-e310-827c-3e5b5873e162) @ PCI:0000:0d:00.0
Jan 21 07:52:36 m8 kernel: [197070.371853] nvidia-modeset: Allocated GPU:11 (GPU-3143f61f-0f5d-1fde-17bf-9949bc461857) @ PCI:0000:0e:00.0
Jan 21 07:53:13 m8 kernel: [197107.221568] nvidia-modeset: Allocated GPU:12 (GPU-335ea0b6-06b6-29b7-4a67-da5d47324df7) @ PCI:0000:0f:00.0

Note the increasing delay between allocating each GPU: 1,1,1,1,1,2,3,5,9,19,37 seconds … 81 seconds between allocating GPU0 and GPU12.

This is very painful as you can’t do anything with any of the GPUs until all of them are loaded. You must wait those 80 seconds all the time: X is delayed, nvidia-smi is delayed, etc. What’s worse is that if you try to launch any gpu app while the driver is loading the GPUs, then it takes double the amount of time as the reinitialization occurs again separately for the gpu app which doesn’t see the driver initialized and reinitializes it again.

nvidia-persistenced doesn’t help … it just triggers the same process which takes the same amount of time (while all other GPU apps are blocked). The machine boots in <10 seconds but then I have to wait a 1.5 minutes until doing anything GPU related, including starting X.

Can Nvidia devs PLEASE do something about this? I was hoping newer drivers to fix this, but it made it worse!

Also, could the driver allocate the GPUs in parallel, and without delays?

EDIT: REPORT GENERATED BY nvidia-bug-report.sh: https://dl.dropbox.com/s/s0y84x5933fxiuk/nvidia-bug-report.log.gz

Normz · January 28, 2018, 6:34am

Bump?

tmbrt · January 29, 2018, 12:10pm

You’re unlikely to receive an answer without providing the nvidia-bug-report.log

Normz · January 30, 2018, 2:12am

I added the report generated by nvidia-bug-report.sh at the end of the 1st post.

Normz · February 8, 2018, 4:33pm

Bump? Can a very kind Nvidia dev look at this? I’ll provide beer/chocolate/whatever would keep you going.

Normz · February 9, 2018, 5:20am

I wonder if it’s possible to have multiple copies of the driver running at the same time, and loading one GPU per driver using the NVReg_AssignGpus module option … I’m not an expert though, could someone shad some light whether this is possible, and how? Would it need inserting multiple kernel modules, different in name but otherwise identical? How? A example I could try would be great. I’ll experiment when I get to the machine.

aplattner · February 9, 2018, 4:43pm

This is being looked into and there are some changes going in to mitigate at least some of the delay. It’s being tracked in bug 2010268.

Normz · February 9, 2018, 10:10pm

Hi Aaron, that’s good to hear, thanks! Note that 387 and 390 drivers have doubled the waiting time from 40 seconds to 81. Halving it back to 40 means it’ll be as bad as it used to be, but 40 sec is still way too much.

The driver has an exponential delay when allocating a new GPU: it doubles the delay with every new allocation (you can time this), the first post shows 1,3,5,9,19,37 seconds for the last 6 out of 13 GPUs.

Could you advise on my question just above? multiple drivers and assigning a single gpu per driver? I found docs about it in the old 331 notes, but not in 384, 387 or 390. However, the NVReg_AssignGpu option is still there in 390. I’d be grateful for some guidance.

generix · February 10, 2018, 3:05am

I also remember the installer option to generate enumerated driver copies for that kind of use case, seems to be gone in newer drivers or left undocumented.

Normz · February 10, 2018, 3:48pm

Indeed, but the NVReg_AssignGPUs module option is still there (try “modinfo nvidia”). I would very much want the ability to restart a single GPU without affecting the others. Right now nvidia-smi requires stopping all gpu apps on all GPUs… it’s completely overkill to have to stop 8+ GPUs and lose all the work so far just because one of the crapped out for some reason. The NVReg_AssignGPUs module option was looking promising. Can anyone from Nvidia please assist?

generix · February 11, 2018, 5:19pm

I’ve taken a look at older driver version, and I think this was removed because multiple drivers breaks uma breaks cuda so useless.

Normz · February 11, 2018, 6:01pm

Did you mean to say NUMA or CUDA? Any link to the page explaining why it was removed?

I don’t care about NUMA for my use case. I don’t see why it would break CUDA. CUDA would see the GPUs that the respective driver exposes, which is controlled by the NVReg_AssignGpus module option. I might be missing something.

generix · February 11, 2018, 6:07pm

nVidia UMA - Unified Memory Access:
[url]https://devblogs.nvidia.com/unified-memory-in-cuda-6/[/url]
I simply downloaded an old 331 driver and started it with option -A, there was the old multiple-drivers option and the explanation stated it doesn’t work with UMA. Newer CUDA versions rely on this afaik, so the option was removed, I suspect, as it’s useless now.

Normz · February 11, 2018, 7:19pm

Thanks. Still sounds a bit speculative though - any links where it’s stated that newer CUDA 8/9 require UMA? 331 drivers worked with CUDA 6, when UMA was available.

The driver option NVReg_AssignGPUs is still there, suggesting it’s still possible to assign a subset of GPUs to a driver.

Perhaps someone from Nvidia can confirm?

generix · February 11, 2018, 7:36pm

NVReg_AssignGPU is also there for other use cases like pci pt.

Normz · March 6, 2018, 6:48am

Hi @aplattner - any news on this? You mentioned it’s tracked in bug 2010268. Is that something we can see?

Normz · September 1, 2018, 11:51pm

@aplattner Ok, this WAS fixed in 390.x but then you guys re-introduced this bug in 396.x and it’s now even worse … up to 3 full minutes! 396.x is still beta, hopefully you’ll fix it again.

tytanick · September 5, 2018, 1:15pm

I can confirm that this might be the driver version :)
Please fix it

dleone · September 7, 2018, 7:18pm

Could you confirm exactly which 396.x you are referring to and also attach nvidia-bug-report.log.gz?

tytanick · September 8, 2018, 6:40am

We tested both drivers and implemented it in flashable image (using etcher.io program)
Download OS image | SimpleMining.net
Download OS image | SimpleMining.net

Both have very slow response for nvidia-settings and nvidia-smi while computing is taking place.
We tested this with GTX1070 and GTX1060, the same reaction.
So the 4.17.19 kernel and 396.54 or 390.87 have the same problem.

On kernel 4.13.10 + nvidia 387.22 it is working fine (the same PC with the same GPUs)

If you want, you can download our Operating System Image based on ubuntu and with our precompiled kernel with nvidia and amd driver.
Thanks for answering :)

Topic		Replies	Views
Very(!) slow ramp down from high to low clock speeds leading to a significantly increased power cons Linux	159	26041	February 6, 2024
NVIDIA driver is not loaded. Ubuntu 18.10 Linux	310	129768	February 14, 2024
Black screen after install of nvidia driver ubuntu Linux	224	160144	February 27, 2025
nvidia-smi power limit on GTX 1060 Linux	61	53590	February 13, 2018
Dual GeForce GTX or Titan V on mobo, unable to display upon launching Ubuntu 18.04 Linux	35	3059	April 10, 2020
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62454	February 14, 2021
Random low frame rate intervels no matter how much is running Linux	22	3692	October 27, 2024
396.18.02, Neon - sddm crash on boot - Xid 62 - NVRM: rm_init_adapter failed for device bearing min... Linux	46	16758	July 16, 2018
[370.28] with kernel [4.8] on >=2015 machines: driver claims card not supported if nvidia is not primary card Linux	37	21474	September 26, 2017
The GPU FAN runs heavily after the process is done. CUDA Setup and Installation	19	4808	July 20, 2017

Painfully long driver initialization with many GPUs -- affects ALL drivers (Nvidia, please do someth...

Related topics