Nvidia-container-cli not showing mig devices

After installing the cuda and nvidia drivers (Driver Version: 530.30.02 CUDA Version: 12.1) and installing the nvidia-container-toolkit, then setting up mig devices, the nvidia-container-cli list command does not show mig devices under /dev/nvida-caps/:

/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/lib/nvidia/current/nvidia-smi
/usr/lib/nvidia/current/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nv-fabricmanager
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-opencl.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-allocator.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-nvvm.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-opticalflow.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-fbc.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvoptix.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libEGL_nvidia.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv2_nvidia.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv1_CM_nvidia.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.530.30.02
/run/nvidia-persistenced/socket
/lib/firmware/nvidia/530.30.02/gsp_ga10x.bin
/lib/firmware/nvidia/530.30.02/gsp_tu10x.bin

Although they exist:

root@gpu1:~# ls -al /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Jun 14 09:47 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jun 14 09:47 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Jun 14 09:47 /dev/nvidia-modeset
crw-rw-rw- 1 root root 508,   0 Jun 14 09:47 /dev/nvidia-uvm
crw-rw-rw- 1 root root 508,   1 Jun 14 09:47 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0
drw-rw-rw-  2 root root      320 Jun 14 09:48 .
drwxr-xr-x 21 root root     4520 Jun 14 09:47 ..
cr--------  1 root root 238,   1 Jun 14 09:47 nvidia-cap1
cr--r--r--  1 root root 238, 102 Jun 14 09:48 nvidia-cap102
cr--r--r--  1 root root 238, 103 Jun 14 09:48 nvidia-cap103
cr--r--r--  1 root root 238, 111 Jun 14 09:48 nvidia-cap111
cr--r--r--  1 root root 238, 112 Jun 14 09:48 nvidia-cap112
cr--r--r--  1 root root 238, 120 Jun 14 09:48 nvidia-cap120
cr--r--r--  1 root root 238, 121 Jun 14 09:48 nvidia-cap121
cr--r--r--  1 root root 238, 129 Jun 14 09:48 nvidia-cap129
cr--r--r--  1 root root 238, 130 Jun 14 09:48 nvidia-cap130
cr--r--r--  1 root root 238,   2 Jun 14 09:47 nvidia-cap2
cr--r--r--  1 root root 238,  30 Jun 14 09:48 nvidia-cap30
cr--r--r--  1 root root 238,  31 Jun 14 09:48 nvidia-cap31
cr--r--r--  1 root root 238,  39 Jun 14 09:48 nvidia-cap39
cr--r--r--  1 root root 238,  40 Jun 14 09:48 nvidia-cap40

Whereas nvidia-smi lists the mig devices as expected:

root@gpu1:~# nvidia-smi 
Wed Jun 14 10:46:54 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe           On | 00000000:CA:00.0 Off |                   On |
| N/A   35C    P0               43W / 300W|     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG|
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    3   0   0  |              25MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    4   0   1  |              12MiB / 19968MiB  | 14      0 |  1   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   11   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   12   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   13   0   4  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   14   0   5  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

In contrast, the nvidia-ctk cdi generate command outputs the mig devices correctly:
nvidia.yml (8.9 KB)

Context:
Using the GPU works fine, and it can also be used in the containers. However for the LXC containers, the devices need to be mounted manually and made privileged, which poses many security problems. As the nvidia-container-cli does not recognize the mig devices, it is not possible to use the NVIDIA_VISIBLE_DEVICES environment variable which would mount all necessary files with the lxc hook.
I believe I have missed something in the setup, but I did not find any intermediate steps after creating the mig devices or installing nvidia-container-cli other than configuring it for docker (as there are no guides in the documentation and the blog post is a bit outdated and does not mention mig devices)

System Info:
Operating System: Debian 11 / Proxmox
Container Architecture: LXC

References: (Sorry for broken links, can only have one per post…)
Blog post about setting up nvidia with LXC: developer.nvidia .com/blog/gpu-containers-runtime/
Similar post I opened already at LXC Forum: discuss.linuxcontainers .org/t/how-to-pass-nvidia-mig-devices-using-lxc-config/17175
Post on Proxmox forum: forum.proxmox .com/conversations/gpu.2323/ (had some conversation with the OP about his setup, but did not resolve the issue)

I found my issue, see Github Issue. From another Guide somewhere on the internet, I had the following udev rule:

KERNEL=="nvidia", RUN+="/bin/bash -c '/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia* && /usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*'"

After removing that and just restarting it worked!

So takeaway is: do not fiddle with the nvidia devices and don’t run nvidia-modprobe yourself.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.