[solved - somehow] CUDA in Docker gives "Failed to initialize NVML: Unknown Error"

Hi,

I had CUDA 11.8 running on an Geforce RTX 3090 for a long time, but at some timepoint it stopped working - I am unsure when, maybe it corresponds with an upgrade to Ubuntu 22.4, but I am unsure.

I am using the 535 Nvidia driver and CUDA 11.8 (this is necessary). Nvidia-smi and NVCC --version runs great on the host, but when I spin up a docker container with this command:

sudo docker run --rm --gpus all nvidia/cuda:11.8.0-devel-ubuntu22.04 nvidia-smi
I get the infamous “Failed to initialize NVML: Unknown Error” error.

Interestingly, if I spin up the container like this:
sudo docker run --rm --gpus all --device=/dev/nvidiactl --device=/dev/nvidia0 nvidia/cuda:11.8.0-devel-ubuntu22.04 nvidia-smi
everything works.

Unfortunately, I use premade software and I don’t think I can change its environments. I don’t really understand what’s happening here. If I attach to both of the above containers, the /dev/ folder looks identical…I guess somehow docker does not automatically attaches my GPU (there is only on RTX 3090) correcly, but I can somehow attach it manually. I also need both device entries above - the reason for that also eludes me.

Any help is greatly appreciated, I am dealing with this problem already for hours 🤯

Best
Jan

PS: My nvidia-container-toolkit version ist 1.14.5.

@Robert for moving the thread in the right forum.

Unbelievable, but after more than 2 days of trying everything, it now works. I installed a previous version of nvidia-container-toolkit - which did not solve the problem. But after updating it via apt-get everything worked.
I have no idea why, as I reinstalled Nvidia driver, CUDA and the toolkit multiple times with various combinations over the last days 🤷‍♂️ So the problem is solved for me. I don’t know how and I hope it better never comes back.

Thanks for anyone reading it so far…

I’m sadly getting this error now:

If I run nvidia-smi in the terminal I get:

henry@valhalla:/etc/docker$ nvidia-smi
Wed May  1 17:26:09 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   42C    P8             N/A /  115W |       1MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1080        On  |   00000000:81:00.0 Off |                  N/A |
| 24%   42C    P8              8W /  180W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

But in docker I get:

henry@valhalla:/etc/docker$ docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all ubuntu nvidia-smi -L
Failed to initialize NVML: Unknown Error

Help!