Nvidia driver-container does not work after restart

hi,

i am trying to use driver-containers because i dont need to worry about installing the gpu driver on the host machine. i have successfully followed the instructions in this installation guide.

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Wed Jan 27 11:14:30 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 207… On | 00000000:01:00.0 Off | N/A |
| 60% 32C P8 15W / 215W | 1MiB / 7979MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

but when i restart the host machine or restart the driver container it would stop working.

executing sudo docker run --gpus all nvidia/cuda:11.0-base nvidia-smi would give this error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown.
ERRO[0000] error waiting for container: context canceled

and executing this command sudo docker exec nvidia-driver nvidia-smi would give this error:

Error response from daemon: Container 428761d47496f170a3661660f9551b0286c0cdf359371a9f8eca39951e31fca1 is restarting, wait until the container is running

if i check the running containers using sudo docker container ps the driver-container shows in the list.

stopping the container, removing it, and running it from the driver-container image again as follows makes it work.(i need to do this step every time the host machine is restarted) :

sudo docker container stop nvidia-driver

sudo docker container rm nvidia-driver

sudo docker run --name nvidia-driver -d --privileged --pid=host
-v /run/nvidia:/run/nvidia:shared
-v /var/log:/var/log
–restart=unless-stopped
nvidia/driver:450.80.02-ubuntu18.04

driver-container logs first time it is ran:

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 5.4.0-64-generic

Stopping NVIDIA persistence daemon…
Unloading NVIDIA driver kernel modules…
Unmounting NVIDIA driver rootfs…
Checking NVIDIA driver packages…
Updating the package cache…
Resolving Linux kernel version…
Proceeding with Linux kernel version 5.4.0-64-generic
Installing Linux kernel headers…
Installing Linux kernel module files…
Generating Linux kernel version string…
Compiling NVIDIA driver kernel modules…
/usr/src/nvidia-450.80.02/kernel/nvidia/nv-mmap.c: In function ‘nv_encode_caching’:
/usr/src/nvidia-450.80.02/kernel/nvidia/nv-mmap.c:334:16: warning: this statement may fall through [-Wimplicit-fallthrough=]
if (NV_ALLOW_CACHING(memory_type))
^
/usr/src/nvidia-450.80.02/kernel/nvidia/nv-mmap.c:336:9: note: here
default:
^~~~~~~
Relinking NVIDIA driver kernel modules…
Building NVIDIA driver package nvidia-modules-5.4.0…
Cleaning up the package cache…
Installing NVIDIA driver kernel modules…

WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

ERROR: Unable to open ‘kernel/dkms.conf’ for copying (No such file or directory)

Welcome to the NVIDIA Software Installer for Unix/Linux

Detected 12 CPUs online; setting concurrency level to 12.
Installing NVIDIA driver version 450.80.02.
A precompiled kernel interface for kernel ‘5.4.0-64-generic’ has been found here: ./kernel/precompiled/nvidia-modules-5.4.0.
Kernel module linked successfully.
Kernel module linked successfully.
Kernel module unpacked successfully.
Kernel messages:
[ 1132.420665] pcieport 0000:00:1d.7: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1132.420672] pcieport 0000:00:1d.7: AER: device [8086:a337] error status/mask=00001000/00002000
[ 1132.420678] pcieport 0000:00:1d.7: AER: [12] Timeout
[ 1132.429880] pcieport 0000:00:1d.7: AER: Corrected error received: 0000:00:1d.7
[ 1132.429901] pcieport 0000:00:1d.7: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1132.429909] pcieport 0000:00:1d.7: AER: device [8086:a337] error status/mask=00001000/00002000
[ 1132.429914] pcieport 0000:00:1d.7: AER: [12] Timeout
[ 1132.450789] pcieport 0000:00:1d.7: AER: Corrected error received: 0000:00:1d.7
[ 1132.450809] pcieport 0000:00:1d.7: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1132.450817] pcieport 0000:00:1d.7: AER: device [8086:a337] error status/mask=00001000/00002000
[ 1132.450823] pcieport 0000:00:1d.7: AER: [12] Timeout
[ 1156.240465] IPMI message handler: version 39.2
[ 1156.241157] ipmi device interface
[ 1156.248526] nvidia: loading out-of-tree module taints kernel.
[ 1156.248531] nvidia: module license ‘NVIDIA’ taints kernel.
[ 1156.248532] Disabling lock debugging due to kernel taint
[ 1156.254469] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 1156.260695] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
[ 1156.261219] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 1156.304176] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 450.80.02 Wed Sep 23 01:13:39 UTC 2020
[ 1156.307380] nvidia-uvm: Loaded the UVM driver, major device number 510.
[ 1156.308578] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 450.80.02 Wed Sep 23 00:48:09 UTC 2020
[ 1156.309967] nvidia-modeset: Unloading
[ 1156.541147] nvidia-uvm: Unloaded the UVM driver.
[ 1156.569918] nvidia-nvlink: Unregistered the Nvlink Core, major device number 234
Installing ‘NVIDIA Accelerated Graphics Driver for Linux-x86_64’ (450.80.02):
Installing: [##############################] 100%
Driver file installation is complete.
Running post-install sanity check:
Checking: [##############################] 100%
Post-install sanity check passed.
Running runtime sanity check:
Checking: [##############################] 100%
Runtime sanity check passed.

Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 450.80.02) is now complete.

Loading ipmi and i2c_core kernel modules…
Loading NVIDIA driver kernel modules…
Starting NVIDIA persistence daemon…
Mounting NVIDIA driver rootfs…
Done, now waiting for signal

driver-container logs after reboot or stopping starting container:

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 5.4.0-64-generic

Stopping NVIDIA persistence daemon…
Unloading NVIDIA driver kernel modules…
Unmounting NVIDIA driver rootfs…
Checking NVIDIA driver packages…
Found NVIDIA driver package nvidia-modules-5.4.0
Installing NVIDIA driver kernel modules…

WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

ERROR: Unable to open ‘kernel/dkms.conf’ for copying (No such file or directory)

Welcome to the NVIDIA Software Installer for Unix/Linux

Detected 12 CPUs online; setting concurrency level to 12.
Installing NVIDIA driver version 450.80.02.
A precompiled kernel interface for kernel ‘5.4.0-64-generic’ has been found here: ./kernel/precompiled/nvidia-modules-5.4.0.
Kernel module linked successfully.
Kernel module linked successfully.
Kernel module unpacked successfully.
Kernel messages:
[ 1866.707275] nvidia-modeset: Unloading
[ 1866.936920] nvidia-uvm: Unloaded the UVM driver.
[ 1866.966199] nvidia-nvlink: Unregistered the Nvlink Core, major device number 234
[ 1867.378962] docker0: port 1(veth45e93da) entered disabled state
[ 1867.379018] vethc21530f: renamed from eth0
[ 1867.472738] docker0: port 1(veth45e93da) entered disabled state
[ 1867.477967] device veth45e93da left promiscuous mode
[ 1867.477969] docker0: port 1(veth45e93da) entered disabled state
[ 1867.587483] docker0: port 1(veth1baabd5) entered blocking state
[ 1867.587486] docker0: port 1(veth1baabd5) entered disabled state
[ 1867.587623] device veth1baabd5 entered promiscuous mode
[ 1867.587908] docker0: port 1(veth1baabd5) entered blocking state
[ 1867.587911] docker0: port 1(veth1baabd5) entered forwarding state
[ 1867.876683] eth0: renamed from veth881020b
[ 1867.901027] IPv6: ADDRCONF(NETDEV_CHANGE): veth1baabd5: link becomes ready
[ 1868.287640] IPMI message handler: version 39.2
[ 1868.288364] ipmi device interface
[ 1868.306826] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
[ 1868.307567] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem
[ 1868.350560] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 450.80.02 Wed Sep 23 01:13:39 UTC 2020
[ 1868.353675] nvidia-uvm: Loaded the UVM driver, major device number 510.
[ 1868.354894] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 450.80.02 Wed Sep 23 00:48:09 UTC 2020
[ 1868.356270] nvidia-modeset: Unloading
[ 1868.572817] nvidia-uvm: Unloaded the UVM driver.
[ 1868.610280] nvidia-nvlink: Unregistered the Nvlink Core, major device number 234
Parsing log file:
Parsing: [##############################] 100%

ERROR: The file ‘/lib/modules/5.4.0-64-generic/kernel/drivers/video/nvidia.ko’ already exists as part of this driver installation.

ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Stopping NVIDIA persistence daemon…
Unloading NVIDIA driver kernel modules…
Unmounting NVIDIA driver rootfs…

1 Like

This is effecting me also. Did you make any progress?

OK I resolved this issue. The source code for the nvidia image with the init bug in it is available on nvidia’s gitlab repository, you can fix the bug in the nvidia-driver script, and then repackage your own bug free docker image

hi, can u pls provide details? have the same one

guys, anyone? no solution on this forum??

Having the same issue, @anon60771820 could you please share your solution?

@anon60771820 could you possibly share your fix. I am looking at the git but not sure which part of the install.sh is wrong or if I am looking in the right place

@nikolay.kalachev @michael.thomas @shaul6q8an

Sorry Guys didn’t do it. I can still fix the issue with the container working and you can bring up NVIDIA containers that use the acceleration no worries after getting the nvidia-driver container running.

I hit a deeper more fundamental problem. The NVIDIA Driver in the container is mostly useful in a production quality multi-container system. Common sense would dictate that you program for docker compose or similar system orchestration to start the nvidia-driver container, and then afterwards start your other docker based service which is dependent upon the nvidia-driver container being up and operational.

Now when you run the nvidia-driver container you are running the container in privileged mode, and the nvidia container runtime is using these elevated privileges to load the kernel objects for the cuda driver into the hosts kernel after compiling them each time. You need to recompile them each time because there is no mechanism to know what the kernel version was for the kernel objects compiled last time the nvidia-driver container was run . If you program your cuda container to start after the nvidia-driver container it will do exactly that, but the kernel objects will not yet be compiled and you will get the same error as if the driver is not installed, “OCI yada yada error”. 1 minute later the nvidia-driver will be working and hook in the kernel object.

I will try to put this work up so someone else can finish it. The root problem is unclear policy for how drivers should be handled in containers (docker etc), as well as the nvidia developers not thinking about how someone might make use of the nvidia-driver container.