Hi,
I had CUDA 11.8 running on an Geforce RTX 3090 for a long time, but at some timepoint it stopped working - I am unsure when, maybe it corresponds with an upgrade to Ubuntu 22.4, but I am unsure.
I am using the 535 Nvidia driver and CUDA 11.8 (this is necessary). Nvidia-smi and NVCC --version runs great on the host, but when I spin up a docker container with this command:
sudo docker run --rm --gpus all nvidia/cuda:11.8.0-devel-ubuntu22.04 nvidia-smi
I get the infamous “Failed to initialize NVML: Unknown Error” error.
Interestingly, if I spin up the container like this:
sudo docker run --rm --gpus all --device=/dev/nvidiactl --device=/dev/nvidia0 nvidia/cuda:11.8.0-devel-ubuntu22.04 nvidia-smi
everything works.
Unfortunately, I use premade software and I don’t think I can change its environments. I don’t really understand what’s happening here. If I attach to both of the above containers, the /dev/ folder looks identical…I guess somehow docker does not automatically attaches my GPU (there is only on RTX 3090) correcly, but I can somehow attach it manually. I also need both device entries above - the reason for that also eludes me.
Any help is greatly appreciated, I am dealing with this problem already for hours 🤯
Best
Jan
PS: My nvidia-container-toolkit version ist 1.14.5.
@Robert for moving the thread in the right forum.
Unbelievable, but after more than 2 days of trying everything, it now works. I installed a previous version of nvidia-container-toolkit - which did not solve the problem. But after updating it via apt-get everything worked.
I have no idea why, as I reinstalled Nvidia driver, CUDA and the toolkit multiple times with various combinations over the last days 🤷♂️ So the problem is solved for me. I don’t know how and I hope it better never comes back.
Thanks for anyone reading it so far…
I’m sadly getting this error now:
If I run nvidia-smi in the terminal I get:
henry@valhalla:/etc/docker$ nvidia-smi
Wed May 1 17:26:09 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 On | 00000000:01:00.0 Off | N/A |
| 0% 42C P8 N/A / 115W | 1MiB / 8188MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce GTX 1080 On | 00000000:81:00.0 Off | N/A |
| 24% 42C P8 8W / 180W | 1MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
But in docker I get:
henry@valhalla:/etc/docker$ docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all ubuntu nvidia-smi -L
Failed to initialize NVML: Unknown Error
Help!