After installing the cuda and nvidia drivers (Driver Version: 530.30.02 CUDA Version: 12.1) and installing the nvidia-container-toolkit, then setting up mig devices, the nvidia-container-cli list
command does not show mig devices under /dev/nvida-caps/
:
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/lib/nvidia/current/nvidia-smi
/usr/lib/nvidia/current/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nv-fabricmanager
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-opencl.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-allocator.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-nvvm.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-opticalflow.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-fbc.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvoptix.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libEGL_nvidia.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv2_nvidia.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv1_CM_nvidia.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.530.30.02
/run/nvidia-persistenced/socket
/lib/firmware/nvidia/530.30.02/gsp_ga10x.bin
/lib/firmware/nvidia/530.30.02/gsp_tu10x.bin
Although they exist:
root@gpu1:~# ls -al /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 Jun 14 09:47 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jun 14 09:47 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Jun 14 09:47 /dev/nvidia-modeset
crw-rw-rw- 1 root root 508, 0 Jun 14 09:47 /dev/nvidia-uvm
crw-rw-rw- 1 root root 508, 1 Jun 14 09:47 /dev/nvidia-uvm-tools
/dev/nvidia-caps:
total 0
drw-rw-rw- 2 root root 320 Jun 14 09:48 .
drwxr-xr-x 21 root root 4520 Jun 14 09:47 ..
cr-------- 1 root root 238, 1 Jun 14 09:47 nvidia-cap1
cr--r--r-- 1 root root 238, 102 Jun 14 09:48 nvidia-cap102
cr--r--r-- 1 root root 238, 103 Jun 14 09:48 nvidia-cap103
cr--r--r-- 1 root root 238, 111 Jun 14 09:48 nvidia-cap111
cr--r--r-- 1 root root 238, 112 Jun 14 09:48 nvidia-cap112
cr--r--r-- 1 root root 238, 120 Jun 14 09:48 nvidia-cap120
cr--r--r-- 1 root root 238, 121 Jun 14 09:48 nvidia-cap121
cr--r--r-- 1 root root 238, 129 Jun 14 09:48 nvidia-cap129
cr--r--r-- 1 root root 238, 130 Jun 14 09:48 nvidia-cap130
cr--r--r-- 1 root root 238, 2 Jun 14 09:47 nvidia-cap2
cr--r--r-- 1 root root 238, 30 Jun 14 09:48 nvidia-cap30
cr--r--r-- 1 root root 238, 31 Jun 14 09:48 nvidia-cap31
cr--r--r-- 1 root root 238, 39 Jun 14 09:48 nvidia-cap39
cr--r--r-- 1 root root 238, 40 Jun 14 09:48 nvidia-cap40
Whereas nvidia-smi lists the mig devices as expected:
root@gpu1:~# nvidia-smi
Wed Jun 14 10:46:54 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:CA:00.0 Off | On |
| N/A 35C P0 43W / 300W| 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+================================+===========+=======================|
| 0 3 0 0 | 25MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 4 0 1 | 12MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 11 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 12 0 3 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 13 0 4 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 14 0 5 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
In contrast, the nvidia-ctk cdi generate
command outputs the mig devices correctly:
nvidia.yml (8.9 KB)
Context:
Using the GPU works fine, and it can also be used in the containers. However for the LXC containers, the devices need to be mounted manually and made privileged, which poses many security problems. As the nvidia-container-cli does not recognize the mig devices, it is not possible to use the NVIDIA_VISIBLE_DEVICES
environment variable which would mount all necessary files with the lxc hook.
I believe I have missed something in the setup, but I did not find any intermediate steps after creating the mig devices or installing nvidia-container-cli other than configuring it for docker (as there are no guides in the documentation and the blog post is a bit outdated and does not mention mig devices)
System Info:
Operating System: Debian 11 / Proxmox
Container Architecture: LXC
References: (Sorry for broken links, can only have one per post…)
Blog post about setting up nvidia with LXC: developer.nvidia .com/blog/gpu-containers-runtime/
Similar post I opened already at LXC Forum: discuss.linuxcontainers .org/t/how-to-pass-nvidia-mig-devices-using-lxc-config/17175
Post on Proxmox forum: forum.proxmox .com/conversations/gpu.2323/ (had some conversation with the OP about his setup, but did not resolve the issue)