NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

chestnut890123 · September 3, 2020, 10:01pm

wsl 4.19.121
ubuntu 18.04
cuda-toolkit 11.0
nvidia driver: 460.15

I’ve followed the steps in this tutorial

But when I finished and called ‘nvidia-smi’. I got the following error:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

onomatopellan · September 4, 2020, 12:08am

That’s one of the known limitations.

And it’s also a feature already planned for an upcoming driver release:Preview for CUDA on WSL Updated for Performance

chestnut890123 · September 4, 2020, 2:24am

Got it, thank you!
But when I run pytorch torch.cuda.is_available(), it returns False.

onomatopellan · September 4, 2020, 5:43pm

Does any of the other tutorials like docker nbody / Tensorfow work?

rboissel · September 4, 2020, 8:40pm

You should double check as well that you didn’t accidentally installed a native linux driver on your WSL system. The fact that you have the nvidia-smi binary is a bit suspicious for that matter.

If you install a native driver (either directly or indirectly by using a toolkit package for instance that comes with the native driver as a dependency) you will shadow the real WSL and your apps will pick the wrong driver. There are a couple of post on that topic.

Thanks,

chestnut890123 · September 9, 2020, 8:48pm

I tried to uninstall all pytorch related package and install cpu-only version as it will include cuda-toolkit in gpu version.
After that I just tried this example

docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark

And got the following error:

Unable to find image 'nvcr.io/nvidia/k8s/cuda-sample:nbody' locally
nbody: Pulling from nvidia/k8s/cuda-sample
22dc81ace0ea: Pull complete
1a8b3c87dba3: Pull complete
91390a1c435a: Pull complete
07844b14977e: Pull complete
b78396653dae: Pull complete
95e837069dfa: Pull complete
fef4aadda783: Pull complete
343234bd5cf3: Pull complete
d1e57bfda6f0: Pull complete
c67b413dfc79: Pull complete
529d6d22ae9f: Pull complete
d3a7632db2b3: Pull complete
4a28a573fcc2: Pull complete
71a88f11fc6a: Pull complete
11019d591d86: Pull complete
10f906646436: Pull complete
9b617b771963: Pull complete
6515364916d7: Pull complete
Digest: sha256:aaca690913e7c35073df08519f437fa32d4df59a89ef1e012360fbec46524ec8
Status: Downloaded newer image for nvcr.io/nvidia/k8s/cuda-sample:nbody
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.
ERRO[0023] error waiting for container: context canceled

onomatopellan · September 11, 2020, 12:53am

what’s the output of dpkg -l | grep -i nvidia ?

chestnut890123 · September 11, 2020, 1:16am

$ dpkg -l | grep -i nvidia
ii  cuda-nsight-compute-10-1       10.1.243-1                         amd64        NVIDIA Nsight Compute
ii  cuda-nsight-systems-10-1       10.1.243-1                         amd64        NVIDIA Nsight Systems
ii  cuda-nvtx-10-1                 10.1.243-1                         amd64        NVIDIA Tools Extension
ii  libnvidia-container-tools      1.3.0~rc.1-1                       amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64     1.3.0~rc.1-1                       amd64        NVIDIA container runtime library
ii  nsight-compute-2020.1.2        2020.1.2.4-1                       amd64        NVIDIA Nsight Compute
ii  nvidia-container-runtime       3.3.0-1                            amd64        NVIDIA container runtime
ii  nvidia-container-toolkit       1.2.1-1                            amd64        NVIDIA container runtime hook
ii  nvidia-docker2                 2.4.0-1                            all          nvidia-docker CLI wrapper

Is that I because I tried cuda10.1? Though I had the same problem with cuda 11.0

onomatopellan · September 11, 2020, 2:51pm

I have both toolkits 10.1 and 11 installed so that’s not the problem. As rboissel said you probably have the linux driver installed, although it doesn’t look like it was installed via apt install. Either way you will need to uninstall it manually with sudo nvidia-uninstall or similar.

I just tried on a brand new Ubuntu18.04 install in WSL2 and docker nbody works just fine.

chestnut890123 · September 11, 2020, 8:18pm

I previously installed nvidia driver for windows system and I could call nvidia-smi on my windows terminal. Do I need to uninstall this, otherwise I really can’t recall I’ve installed a linux driver.

onomatopellan · September 11, 2020, 8:31pm

You don’t need to uninstall the windows driver. The problem is inside your Ubuntu install since nvidia-smi is installed with the nvidia Linux driver.
What’s the output of dpkg -S $(which nvidia-smi) ?

About the docker error, do you happen to have Docker for Desktop installed in Windows?

chestnut890123 · September 11, 2020, 9:03pm

dpkg-query: error: --search needs at least one file name pattern argument

I don’t think I’ve installed Docker on windows.

Could that be related with tensorflow or pytorch?
There is a tutorial for tensorflow in wsl using cuda:

When I import tensorflow, it seemed to open cuda but didn’t recognize any gpu device.

>>> import tensorflow.compat.v1 as tf
2020-09-11 16:58:28.679657: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
>>> tf.config.list_physical_devices('GPU')
[]

onomatopellan · September 11, 2020, 9:26pm

That tutorial is not for cuda but for DirectML and it should work in its context (conda activate directml)

>>> tf.config.experimental.list_physical_devices('GPU')
2020-09-11 23:11:15.973596: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:45] DirectML device enumeration: found 1 compatible adapters.

Your Ubuntu install is right now in an unpredictable state. If you want to see CUDA running in WSL2 then you should uninstall Ubuntu and reinstall again.

chestnut890123 · September 11, 2020, 10:04pm

I see. Thanks for the clarification.

The way I’m downloading Ubuntu is through this link Manual installation steps for older versions of WSL | Microsoft Learn
Since I don’t want to install in system drive, I changed the file extension from ‘.appx’ to ‘.zip’ and unzip to my working drive (D:). From there, I could just click the ‘ubuntu1804.exe’ file. After that I just installed everything liked the cuda wsl tutorial.

Could there be anything wrong in this process?
Capture

PS: My wsl version is now:

C:\Users\chest>wsl cat /proc/version
Linux version 4.19.128-microsoft-standard (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP Tue Jun 23 12:58:10 UTC 2020

onomatopellan · September 11, 2020, 10:13pm

That’s just fine, I also have my distro installed in another partition. If you don’t mind deleting all things inside the Ubuntu distro you only need to do this:

Open cmd.exe and run wsl.exe --unregister Ubuntu-18.04

That will delete the ext4.vhdx file that contains the distro. Next time you double click ubuntu1804.exe will install anew.

chestnut890123 · September 11, 2020, 10:19pm

Yes… that’s exactly what I did before. Don’t think will give a difference but will try again.

onomatopellan · September 11, 2020, 10:22pm

This time don’t install CUDA toolkit, just do sudo apt update && sudo apt upgrade -y and start following the tutorial from the docker section

chestnut890123 · September 11, 2020, 10:38pm

Still not working…

~$ sudo docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
Unable to find image 'nvcr.io/nvidia/k8s/cuda-sample:nbody' locally
nbody: Pulling from nvidia/k8s/cuda-sample
22dc81ace0ea: Pull complete
1a8b3c87dba3: Pull complete
91390a1c435a: Pull complete
07844b14977e: Pull complete
b78396653dae: Pull complete
95e837069dfa: Pull complete
fef4aadda783: Pull complete
343234bd5cf3: Pull complete
d1e57bfda6f0: Pull complete
c67b413dfc79: Pull complete
529d6d22ae9f: Pull complete
d3a7632db2b3: Pull complete
4a28a573fcc2: Pull complete
71a88f11fc6a: Pull complete
11019d591d86: Pull complete
10f906646436: Pull complete
9b617b771963: Pull complete
6515364916d7: Pull complete
Digest: sha256:aaca690913e7c35073df08519f437fa32d4df59a89ef1e012360fbec46524ec8
Status: Downloaded newer image for nvcr.io/nvidia/k8s/cuda-sample:nbody
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.
ERRO[0065] error waiting for container: context canceled

onomatopellan · September 11, 2020, 10:44pm

Well, now at least you can tell me what exactly steps you did until you got there. Use history and post all the commands you did enter from a brand new installed Ubuntu 18.04 and I’ll try to reproduce it.

chestnut890123 · September 11, 2020, 11:03pm

    1  sudo apt-get update && sudo apt-get upgrade
    2  clear
    3  ls /mnt/d
    4  sh /mnt/d/Miniconda3-4.5.4-Linux-x86_64.sh
    5  conda install python=3.7.3
    6  clear
    7  python3
    8  clear
    9  pip install tensorflow --user
   10  clear
   11  python3
   12  clear
   13  curl https://get.docker.com | sh
   14  distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
   15  curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
   16  curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
   17  curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container-experimental.list | sudo tee /etc/apt/sources.list.d/libnvidia-container-experimental.list
   18  sudo apt-get update
   19  sudo apt-get install -y nvidia-docker2
   20  clear
   21  sudo service docker stop
   22  sudo service docker start
   23  sudo service docker status
   24  sudo docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
   25  clear
   26  history

Topic		Replies	Views
470.14 - WSL with W10 Build 21343 - NVIDIA-SMI error CUDA on Windows Subsystem for Linux	43	19052	November 21, 2021
Nvidia-smi can't communicate with driver -- docker-desktop conflict? CUDA on Windows Subsystem for Linux cuda , wsl	3	2643	April 10, 2023
Failure to install CUDA on WSL 2 Ubuntu CUDA on Windows Subsystem for Linux	65	46764	September 10, 2021
CUDA sample throwing error CUDA on Windows Subsystem for Linux	46	22989	April 29, 2022
[SOLVED] Run CUDA on dedicated NVIDIA GPU while connecting monitors to Intel HD graphics, is this possible? CUDA Setup and Installation	15	72029	December 9, 2018
Stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown CUDA on Windows Subsystem for Linux	35	37837	August 21, 2023
Hiccups setting up WSL2 + CUDA CUDA on Windows Subsystem for Linux	19	9983	October 12, 2021
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" Ubuntu 16.04 CUDA Setup and Installation	79	371597	March 19, 2021
CUDA 10 installation problems on Ubuntu 18.04 CUDA Setup and Installation	24	94594	December 11, 2020
In what step is nvidia-smi supposed to be installed? CUDA Programming and Performance	13	125456	December 16, 2022

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

Related topics