"driver/library version mismatch: unknown"

OS: Ubuntu 20.04.1 LTS
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
GPUs: 4 * 4090
Last week, Docker was able to build without any issues. However, after two days, an error occurred stating “driver/library version mismatch: unknown”. Although many people online suggest that restarting can solve the problem, we need our application to be stable. Therefore, I would like to know what exactly happened and how to completely avoid such incidents.
Here’s more information:

Attaching to test-lora-sd-webui1, train-lora1
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown

Additionally, here’s nvidia-bug-report.sh output:
nvidia-bug-report.log.gz (1.3 MB)

No to be hijacking your topic, but I’d like to let you know I’m tracking this because I have the exact same issue.

Our server is happily running several Deepstream containers day and night. Starting containers, stopping containers, some run for weeks others for minutes. All seems fine but sometimes it’s suddenly impossible to start new containers and Docker gives this exact error you describe.

All we can do to fix it is a hard reboot. Which takes all our running analysis containers offline for a few minutes.

Sometimes we have to do this on a daily basis, sometimes we don’t have to for weeks. It’s a really strange problem. I’d like to know how to further debug this issue if anyone knows.

1 Like

So this question is important, and everyone is eager to know the underlying reasons not just reboot your device.

1 Like

Just wondering if anyone has made any progress on this issue? For me, it seems like my containers run successfully the first time I run a process on them, but then when I try and re-run a process I get this error message, suggesting that something has caused the “version mismatch” during the first run and it is only encountered when trying to run the second time.

I managed to trigger the error by running the container, making a few nvidia related calls, exiting the container and then trying to run it again:

root@fep3-compute-ghpc-0:/home/nharrison/mdynamics# docker run --gpus all --entrypoint /bin/bash -it --rm -v /home/nharrison:/mount us-central1-docker.pkg.dev/gcp-fep-experiment-001/bssdocker/bssdocker_image:v0.9
(base) root@d86bec40378a:/app# lsmod | grep nvidia
nvidia_uvm           1454080  0
nvidia_drm             77824  3
nvidia_modeset       1273856  3 nvidia_drm
nvidia              55734272  168 nvidia_uvm,nvidia_modeset
drm_kms_helper        307200  1 nvidia_drm
drm                   618496  7 drm_kms_helper,nvidia,nvidia_drm
(base) root@d86bec40378a:/app# sudo rmmod nvidia_drm
rmmod: ERROR: Module nvidia_drm is in use
(base) root@d86bec40378a:/app# sudo lsof /dev/nvidia*
sudo: lsof: command not found
(base) root@d86bec40378a:/app# sudo init 3
Couldn't find an alternative telinit implementation to spawn.
(base) root@d86bec40378a:/app# exit
exit
root@fep3-compute-ghpc-0:/home/nharrison/mdynamics# docker run --entrypoint /bin/bash -it --rm -v /home/nharrison:/mount us-central1-docker.pkg.dev/gcp-fep-experiment-001/bssdocker/bssdocker_image:v0.9
(base) root@f52aff3db7c1:/app# ^C
(base) root@f52aff3db7c1:/app# lsmod | grep nvidia
nvidia_uvm           1454080  0
nvidia_drm             77824  3
nvidia_modeset       1273856  3 nvidia_drm
nvidia              55734272  165 nvidia_uvm,nvidia_modeset
drm_kms_helper        307200  1 nvidia_drm
drm                   618496  7 drm_kms_helper,nvidia,nvidia_drm
(base) root@f52aff3db7c1:/app# sudo rmmod nvidia_drm
rmmod: ERROR: Module nvidia_drm is in use
(base) root@f52aff3db7c1:/app# lsof
bash: lsof: command not found
(base) root@f52aff3db7c1:/app# sudo lsof
sudo: lsof: command not found
(base) root@f52aff3db7c1:/app# /usr/sbin/lsof
bash: /usr/sbin/lsof: No such file or directory
(base) root@f52aff3db7c1:/app# 
(base) root@f52aff3db7c1:/app# ls /usr/bin/lsof^C
(base) root@f52aff3db7c1:/app# /usr/sbin/lso^C
(base) root@f52aff3db7c1:/app# fusur
bash: fusur: command not found
(base) root@f52aff3db7c1:/app# fuser
bash: fuser: command not found
(base) root@f52aff3db7c1:/app# dmesg | grep NVRM
dmesg: read kernel buffer failed: Operation not permitted
(base) root@f52aff3db7c1:/app# exit
exit
root@fep3-compute-ghpc-0:/home/nharrison/mdynamics# docker run --gpus all --entrypoint /bin/bash -it --rm -v /home/nharrison:/mount us-central1-docker.pkg.dev/gcp-fep-experiment-001/bssdocker/bssdocker_image:v0.9
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown.