A problem in using Nvidia graphics cards

Hi everyone
We have encountered a problem in using Nvidia graphics cards, which we will explain in detail below.

We have installed Debian 11 on the servers with Nvidia graphics card installed. then we have installed Open Nebula for hypervisor on it for virtualization.
Then, in order to use graphics cards in virtual machines created by Open Nebula, we proceeded according to the documentation and accessed and passthrough the GPU in the virtual machines.

debian servers cpu informations:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD Ryzen Threadripper 3960X 24-Core Processor
Stepping: 0
Frequency boost: enabled
CPU MHz: 2199.662
CPU max MHz: 6635.1558
CPU min MHz: 2200.0000
BogoMIPS: 7600.17
Virtualization: AMD-V

vm on opennebula cpu informations:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 40 bits physical, 48 bits virtual
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 24
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD Ryzen Threadripper 3960X 24-Core Processor
Stepping: 0
CPU MHz: 3797.790
BogoMIPS: 7595.58
Virtualization: AMD-V
Hypervisor vendor: KVM

the result of lspci in the debian server:

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
21:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204] (rev a1)
21:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)

the result of lspci in the vm:
01:01.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2204] (rev a1)
01:02.0 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1)
01:03.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2204] (rev a1)
01:04.0 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1)

To use the GPU in the virtual machine, the following drivers were first installed on Ubuntu 20.04, and then the Nvidia Docker driver was also installed.

nvidia-headless-510 nvidia-utils-510 cuda-toolkit-11-6

the result of nvidia-smi -L in the vm:

root@localhost:~# nvidia-smi
Mon Dec 26 11:24:54 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … On | 00000000:01:01.0 Off | N/A |
| 0% 31C P8 12W / 350W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … On | 00000000:01:03.0 Off | N/A |
| 0% 36C P8 28W / 350W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

the result of nvidia-smi -L in the vm:

root@localhost:~# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-9ba68e69-e2ce-5d2c-ad15-5884706fd049)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-a71db27c-bec8-5f4b-cba3-b4c5a91cf19f)

docker driver : nvidia-docker2

According to the following command, we accessed the GPU using Docker:

docker run --rm --gpus all nvidia/cuda:11.2.0-runtime-ubuntu20.04 nvidia-smi

root@localhost:~# docker run --rm --gpus all nvidia/cuda:11.2.0-runtime-ubuntu20.04 nvidia-smi

Mon Dec 26 11:06:43 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … On | 00000000:01:01.0 Off | N/A |
| 0% 32C P8 12W / 350W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … On | 00000000:01:03.0 Off | N/A |
| 0% 36C P8 27W / 350W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

The problem is here:
when we access to gpu by id 0 in the docker, deviceQuery command hangs out, until we reboot the vm, other access to gpu not work, after we reboot the vm and the request access to gpu by id 1, the deviceQuery command worked!

for more detail:

first we run deviceQuery on gpu by id 1:
root@localhost:~# docker run -it --entrypoint /bin/bash --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=1 nvcr.io/nvidia/tensorflow:21.09-tf2-py3
root@47d6d53c6cde:/workspace# deviceQuery
deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “NVIDIA GeForce RTX 3090”
CUDA Driver Version / Runtime Version 11.6 / 9.0
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 24268 MBytes (25447170048 bytes)
MapSMtoCores for SM 8.6 is undefined. Default to use 64 Cores/SM
MapSMtoCores for SM 8.6 is undefined. Default to use 64 Cores/SM
(82) Multiprocessors, ( 64) CUDA Cores/MP: 5248 CUDA Cores
GPU Max Clock rate: 1695 MHz (1.70 GHz)
Memory Clock rate: 9751 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 3
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.6, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

and when we run by if 0:

root@localhost:~# docker run -it --entrypoint /bin/bash --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 nvcr.io/nvidia/tensorflow:21.09-tf2-py3
root@77e5a4e8e1ad:/workspace# deviceQuery
deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

in this step untile we reboot vm, cant run any access of gpu by id 1!

virtualization (GPU passthrough) on GeForce is only supported when the VM OS is windows, see here.