NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

wsl 4.19.121
ubuntu 18.04
cuda-toolkit 11.0
nvidia driver: 460.15

I’ve followed the steps in this tutorial

But when I finished and called ‘nvidia-smi’. I got the following error:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

That’s one of the known limitations.

And it’s also a feature already planned for an upcoming driver release:Preview for CUDA on WSL Updated for Performance

1 Like

Got it, thank you!
But when I run pytorch torch.cuda.is_available(), it returns False.

Does any of the other tutorials like docker nbody / Tensorfow work?

You should double check as well that you didn’t accidentally installed a native linux driver on your WSL system. The fact that you have the nvidia-smi binary is a bit suspicious for that matter.

If you install a native driver (either directly or indirectly by using a toolkit package for instance that comes with the native driver as a dependency) you will shadow the real WSL and your apps will pick the wrong driver. There are a couple of post on that topic.

Thanks,

I tried to uninstall all pytorch related package and install cpu-only version as it will include cuda-toolkit in gpu version.
After that I just tried this example

docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark

And got the following error:

Unable to find image 'nvcr.io/nvidia/k8s/cuda-sample:nbody' locally
nbody: Pulling from nvidia/k8s/cuda-sample
22dc81ace0ea: Pull complete
1a8b3c87dba3: Pull complete
91390a1c435a: Pull complete
07844b14977e: Pull complete
b78396653dae: Pull complete
95e837069dfa: Pull complete
fef4aadda783: Pull complete
343234bd5cf3: Pull complete
d1e57bfda6f0: Pull complete
c67b413dfc79: Pull complete
529d6d22ae9f: Pull complete
d3a7632db2b3: Pull complete
4a28a573fcc2: Pull complete
71a88f11fc6a: Pull complete
11019d591d86: Pull complete
10f906646436: Pull complete
9b617b771963: Pull complete
6515364916d7: Pull complete
Digest: sha256:aaca690913e7c35073df08519f437fa32d4df59a89ef1e012360fbec46524ec8
Status: Downloaded newer image for nvcr.io/nvidia/k8s/cuda-sample:nbody
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.
ERRO[0023] error waiting for container: context canceled

what’s the output of dpkg -l | grep -i nvidia ?

$ dpkg -l | grep -i nvidia
ii  cuda-nsight-compute-10-1       10.1.243-1                         amd64        NVIDIA Nsight Compute
ii  cuda-nsight-systems-10-1       10.1.243-1                         amd64        NVIDIA Nsight Systems
ii  cuda-nvtx-10-1                 10.1.243-1                         amd64        NVIDIA Tools Extension
ii  libnvidia-container-tools      1.3.0~rc.1-1                       amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64     1.3.0~rc.1-1                       amd64        NVIDIA container runtime library
ii  nsight-compute-2020.1.2        2020.1.2.4-1                       amd64        NVIDIA Nsight Compute
ii  nvidia-container-runtime       3.3.0-1                            amd64        NVIDIA container runtime
ii  nvidia-container-toolkit       1.2.1-1                            amd64        NVIDIA container runtime hook
ii  nvidia-docker2                 2.4.0-1                            all          nvidia-docker CLI wrapper

Is that I because I tried cuda10.1? Though I had the same problem with cuda 11.0

I have both toolkits 10.1 and 11 installed so that’s not the problem. As rboissel said you probably have the linux driver installed, although it doesn’t look like it was installed via apt install. Either way you will need to uninstall it manually with sudo nvidia-uninstall or similar.

I just tried on a brand new Ubuntu18.04 install in WSL2 and docker nbody works just fine.

I previously installed nvidia driver for windows system and I could call nvidia-smi on my windows terminal. Do I need to uninstall this, otherwise I really can’t recall I’ve installed a linux driver.

You don’t need to uninstall the windows driver. The problem is inside your Ubuntu install since nvidia-smi is installed with the nvidia Linux driver.
What’s the output of dpkg -S $(which nvidia-smi) ?

About the docker error, do you happen to have Docker for Desktop installed in Windows?

dpkg-query: error: --search needs at least one file name pattern argument

I don’t think I’ve installed Docker on windows.

Could that be related with tensorflow or pytorch?
There is a tutorial for tensorflow in wsl using cuda:

When I import tensorflow, it seemed to open cuda but didn’t recognize any gpu device.

>>> import tensorflow.compat.v1 as tf
2020-09-11 16:58:28.679657: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>>> tf.config.list_physical_devices('GPU')
[]

That tutorial is not for cuda but for DirectML and it should work in its context (conda activate directml)

>>> tf.config.experimental.list_physical_devices('GPU')
2020-09-11 23:11:15.973596: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:45] DirectML device enumeration: found 1 compatible adapters.

Your Ubuntu install is right now in an unpredictable state. If you want to see CUDA running in WSL2 then you should uninstall Ubuntu and reinstall again.

I see. Thanks for the clarification.

The way I’m downloading Ubuntu is through this link Manual installation steps for older versions of WSL | Microsoft Learn
Since I don’t want to install in system drive, I changed the file extension from ‘.appx’ to ‘.zip’ and unzip to my working drive (D:). From there, I could just click the ‘ubuntu1804.exe’ file. After that I just installed everything liked the cuda wsl tutorial.

Could there be anything wrong in this process?
Capture

PS: My wsl version is now:

C:\Users\chest>wsl cat /proc/version
Linux version 4.19.128-microsoft-standard (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP Tue Jun 23 12:58:10 UTC 2020

That’s just fine, I also have my distro installed in another partition. If you don’t mind deleting all things inside the Ubuntu distro you only need to do this:

Open cmd.exe and run wsl.exe --unregister Ubuntu-18.04

That will delete the ext4.vhdx file that contains the distro. Next time you double click ubuntu1804.exe will install anew.

Yes… that’s exactly what I did before. Don’t think will give a difference but will try again.

This time don’t install CUDA toolkit, just do sudo apt update && sudo apt upgrade -y and start following the tutorial from the docker section

Still not working…

~$ sudo docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
Unable to find image 'nvcr.io/nvidia/k8s/cuda-sample:nbody' locally
nbody: Pulling from nvidia/k8s/cuda-sample
22dc81ace0ea: Pull complete
1a8b3c87dba3: Pull complete
91390a1c435a: Pull complete
07844b14977e: Pull complete
b78396653dae: Pull complete
95e837069dfa: Pull complete
fef4aadda783: Pull complete
343234bd5cf3: Pull complete
d1e57bfda6f0: Pull complete
c67b413dfc79: Pull complete
529d6d22ae9f: Pull complete
d3a7632db2b3: Pull complete
4a28a573fcc2: Pull complete
71a88f11fc6a: Pull complete
11019d591d86: Pull complete
10f906646436: Pull complete
9b617b771963: Pull complete
6515364916d7: Pull complete
Digest: sha256:aaca690913e7c35073df08519f437fa32d4df59a89ef1e012360fbec46524ec8
Status: Downloaded newer image for nvcr.io/nvidia/k8s/cuda-sample:nbody
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.
ERRO[0065] error waiting for container: context canceled

Well, now at least you can tell me what exactly steps you did until you got there. Use history and post all the commands you did enter from a brand new installed Ubuntu 18.04 and I’ll try to reproduce it.

    1  sudo apt-get update && sudo apt-get upgrade
    2  clear
    3  ls /mnt/d
    4  sh /mnt/d/Miniconda3-4.5.4-Linux-x86_64.sh
    5  conda install python=3.7.3
    6  clear
    7  python3
    8  clear
    9  pip install tensorflow --user
   10  clear
   11  python3
   12  clear
   13  curl https://get.docker.com | sh
   14  distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
   15  curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
   16  curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
   17  curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container-experimental.list | sudo tee /etc/apt/sources.list.d/libnvidia-container-experimental.list
   18  sudo apt-get update
   19  sudo apt-get install -y nvidia-docker2
   20  clear
   21  sudo service docker stop
   22  sudo service docker start
   23  sudo service docker status
   24  sudo docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
   25  clear
   26  history