Nvidia driver-container does not work after restart

hi,

i am trying to use driver-containers because i dont need to worry about installing the gpu driver on the host machine. i have successfully followed the instructions in this installation guide.

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Wed Jan 27 11:14:30 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 207… On | 00000000:01:00.0 Off | N/A |
| 60% 32C P8 15W / 215W | 1MiB / 7979MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

but when i restart the host machine or restart the driver container it would stop working.

executing sudo docker run --gpus all nvidia/cuda:11.0-base nvidia-smi would give this error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown.
ERRO[0000] error waiting for container: context canceled

and executing this command sudo docker exec nvidia-driver nvidia-smi would give this error:

Error response from daemon: Container 428761d47496f170a3661660f9551b0286c0cdf359371a9f8eca39951e31fca1 is restarting, wait until the container is running

if i check the running containers using sudo docker container ps the driver-container shows in the list.

stopping the container, removing it, and running it from the driver-container image again as follows makes it work.(i need to do this step every time the host machine is restarted) :

sudo docker container stop nvidia-driver

sudo docker container rm nvidia-driver

sudo docker run --name nvidia-driver -d --privileged --pid=host
-v /run/nvidia:/run/nvidia:shared
-v /var/log:/var/log
–restart=unless-stopped
nvidia/driver:450.80.02-ubuntu18.04

driver-container logs first time it is ran:

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 5.4.0-64-generic

Stopping NVIDIA persistence daemon…
Unloading NVIDIA driver kernel modules…
Unmounting NVIDIA driver rootfs…
Checking NVIDIA driver packages…
Updating the package cache…
Resolving Linux kernel version…
Proceeding with Linux kernel version 5.4.0-64-generic
Installing Linux kernel headers…
Installing Linux kernel module files…
Generating Linux kernel version string…
Compiling NVIDIA driver kernel modules…
/usr/src/nvidia-450.80.02/kernel/nvidia/nv-mmap.c: In function ‘nv_encode_caching’:
/usr/src/nvidia-450.80.02/kernel/nvidia/nv-mmap.c:334:16: warning: this statement may fall through [-Wimplicit-fallthrough=]
if (NV_ALLOW_CACHING(memory_type))
^
/usr/src/nvidia-450.80.02/kernel/nvidia/nv-mmap.c:336:9: note: here
default:
^~~~~~~
Relinking NVIDIA driver kernel modules…
Building NVIDIA driver package nvidia-modules-5.4.0…
Cleaning up the package cache…
Installing NVIDIA driver kernel modules…

WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

ERROR: Unable to open ‘kernel/dkms.conf’ for copying (No such file or directory)

Welcome to the NVIDIA Software Installer for Unix/Linux

Detected 12 CPUs online; setting concurrency level to 12.
Installing NVIDIA driver version 450.80.02.
A precompiled kernel interface for kernel ‘5.4.0-64-generic’ has been found here: ./kernel/precompiled/nvidia-modules-5.4.0.
Kernel module linked successfully.
Kernel module linked successfully.
Kernel module unpacked successfully.
Kernel messages:
[ 1132.420665] pcieport 0000:00:1d.7: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1132.420672] pcieport 0000:00:1d.7: AER: device [8086:a337] error status/mask=00001000/00002000
[ 1132.420678] pcieport 0000:00:1d.7: AER: [12] Timeout
[ 1132.429880] pcieport 0000:00:1d.7: AER: Corrected error received: 0000:00:1d.7
[ 1132.429901] pcieport 0000:00:1d.7: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1132.429909] pcieport 0000:00:1d.7: AER: device [8086:a337] error status/mask=00001000/00002000
[ 1132.429914] pcieport 0000:00:1d.7: AER: [12] Timeout
[ 1132.450789] pcieport 0000:00:1d.7: AER: Corrected error received: 0000:00:1d.7
[ 1132.450809] pcieport 0000:00:1d.7: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 1132.450817] pcieport 0000:00:1d.7: AER: device [8086:a337] error status/mask=00001000/00002000
[ 1132.450823] pcieport 0000:00:1d.7: AER: [12] Timeout
[ 1156.240465] IPMI message handler: version 39.2
[ 1156.241157] ipmi device interface
[ 1156.248526] nvidia: loading out-of-tree module taints kernel.
[ 1156.248531] nvidia: module license ‘NVIDIA’ taints kernel.
[ 1156.248532] Disabling lock debugging due to kernel taint
[ 1156.254469] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 1156.260695] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
[ 1156.261219] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 1156.304176] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 450.80.02 Wed Sep 23 01:13:39 UTC 2020
[ 1156.307380] nvidia-uvm: Loaded the UVM driver, major device number 510.
[ 1156.308578] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 450.80.02 Wed Sep 23 00:48:09 UTC 2020
[ 1156.309967] nvidia-modeset: Unloading
[ 1156.541147] nvidia-uvm: Unloaded the UVM driver.
[ 1156.569918] nvidia-nvlink: Unregistered the Nvlink Core, major device number 234
Installing ‘NVIDIA Accelerated Graphics Driver for Linux-x86_64’ (450.80.02):
Installing: [##############################] 100%
Driver file installation is complete.
Running post-install sanity check:
Checking: [##############################] 100%
Post-install sanity check passed.
Running runtime sanity check:
Checking: [##############################] 100%
Runtime sanity check passed.

Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version 450.80.02) is now complete.

Loading ipmi and i2c_core kernel modules…
Loading NVIDIA driver kernel modules…
Starting NVIDIA persistence daemon…
Mounting NVIDIA driver rootfs…
Done, now waiting for signal

driver-container logs after reboot or stopping starting container:

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 5.4.0-64-generic

Stopping NVIDIA persistence daemon…
Unloading NVIDIA driver kernel modules…
Unmounting NVIDIA driver rootfs…
Checking NVIDIA driver packages…
Found NVIDIA driver package nvidia-modules-5.4.0
Installing NVIDIA driver kernel modules…

WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

ERROR: Unable to open ‘kernel/dkms.conf’ for copying (No such file or directory)

Welcome to the NVIDIA Software Installer for Unix/Linux

Detected 12 CPUs online; setting concurrency level to 12.
Installing NVIDIA driver version 450.80.02.
A precompiled kernel interface for kernel ‘5.4.0-64-generic’ has been found here: ./kernel/precompiled/nvidia-modules-5.4.0.
Kernel module linked successfully.
Kernel module linked successfully.
Kernel module unpacked successfully.
Kernel messages:
[ 1866.707275] nvidia-modeset: Unloading
[ 1866.936920] nvidia-uvm: Unloaded the UVM driver.
[ 1866.966199] nvidia-nvlink: Unregistered the Nvlink Core, major device number 234
[ 1867.378962] docker0: port 1(veth45e93da) entered disabled state
[ 1867.379018] vethc21530f: renamed from eth0
[ 1867.472738] docker0: port 1(veth45e93da) entered disabled state
[ 1867.477967] device veth45e93da left promiscuous mode
[ 1867.477969] docker0: port 1(veth45e93da) entered disabled state
[ 1867.587483] docker0: port 1(veth1baabd5) entered blocking state
[ 1867.587486] docker0: port 1(veth1baabd5) entered disabled state
[ 1867.587623] device veth1baabd5 entered promiscuous mode
[ 1867.587908] docker0: port 1(veth1baabd5) entered blocking state
[ 1867.587911] docker0: port 1(veth1baabd5) entered forwarding state
[ 1867.876683] eth0: renamed from veth881020b
[ 1867.901027] IPv6: ADDRCONF(NETDEV_CHANGE): veth1baabd5: link becomes ready
[ 1868.287640] IPMI message handler: version 39.2
[ 1868.288364] ipmi device interface
[ 1868.306826] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
[ 1868.307567] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem
[ 1868.350560] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 450.80.02 Wed Sep 23 01:13:39 UTC 2020
[ 1868.353675] nvidia-uvm: Loaded the UVM driver, major device number 510.
[ 1868.354894] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 450.80.02 Wed Sep 23 00:48:09 UTC 2020
[ 1868.356270] nvidia-modeset: Unloading
[ 1868.572817] nvidia-uvm: Unloaded the UVM driver.
[ 1868.610280] nvidia-nvlink: Unregistered the Nvlink Core, major device number 234
Parsing log file:
Parsing: [##############################] 100%

ERROR: The file ‘/lib/modules/5.4.0-64-generic/kernel/drivers/video/nvidia.ko’ already exists as part of this driver installation.

ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Stopping NVIDIA persistence daemon…
Unloading NVIDIA driver kernel modules…
Unmounting NVIDIA driver rootfs…