A problem in using Nvidia graphics cards

p.vahidinia · December 29, 2022, 4:21am

Hi everyone
We have encountered a problem in using Nvidia graphics cards, which we will explain in detail below.

We have installed Debian 11 on the servers with Nvidia graphics card installed. then we have installed Open Nebula for hypervisor on it for virtualization.
Then, in order to use graphics cards in virtual machines created by Open Nebula, we proceeded according to the documentation and accessed and passthrough the GPU in the virtual machines.

debian servers cpu informations:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD Ryzen Threadripper 3960X 24-Core Processor
Stepping: 0
Frequency boost: enabled
CPU MHz: 2199.662
CPU max MHz: 6635.1558
CPU min MHz: 2200.0000
BogoMIPS: 7600.17
Virtualization: AMD-V

vm on opennebula cpu informations:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 40 bits physical, 48 bits virtual
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 24
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD Ryzen Threadripper 3960X 24-Core Processor
Stepping: 0
CPU MHz: 3797.790
BogoMIPS: 7595.58
Virtualization: AMD-V
Hypervisor vendor: KVM

the result of lspci in the debian server:

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204] (rev a1)
01:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
21:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204] (rev a1)
21:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)

the result of lspci in the vm:
01:01.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2204] (rev a1)
01:02.0 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1)
01:03.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2204] (rev a1)
01:04.0 Audio device [0403]: NVIDIA Corporation Device [10de:1aef] (rev a1)

To use the GPU in the virtual machine, the following drivers were first installed on Ubuntu 20.04, and then the Nvidia Docker driver was also installed.

nvidia-headless-510 nvidia-utils-510 cuda-toolkit-11-6

the result of nvidia-smi -L in the vm:

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

the result of nvidia-smi -L in the vm:

root@localhost:~# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-9ba68e69-e2ce-5d2c-ad15-5884706fd049)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-a71db27c-bec8-5f4b-cba3-b4c5a91cf19f)

docker driver : nvidia-docker2

According to the following command, we accessed the GPU using Docker:

docker run --rm --gpus all nvidia/cuda:11.2.0-runtime-ubuntu20.04 nvidia-smi

root@localhost:~# docker run --rm --gpus all nvidia/cuda:11.2.0-runtime-ubuntu20.04 nvidia-smi

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

The problem is here:
when we access to gpu by id 0 in the docker, deviceQuery command hangs out, until we reboot the vm, other access to gpu not work, after we reboot the vm and the request access to gpu by id 1, the deviceQuery command worked!

for more detail:

first we run deviceQuery on gpu by id 1:
root@localhost:~# docker run -it --entrypoint /bin/bash --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=1 nvcr.io/nvidia/tensorflow:21.09-tf2-py3
root@47d6d53c6cde:/workspace# deviceQuery
deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “NVIDIA GeForce RTX 3090”
CUDA Driver Version / Runtime Version 11.6 / 9.0
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 24268 MBytes (25447170048 bytes)
MapSMtoCores for SM 8.6 is undefined. Default to use 64 Cores/SM
MapSMtoCores for SM 8.6 is undefined. Default to use 64 Cores/SM
(82) Multiprocessors, ( 64) CUDA Cores/MP: 5248 CUDA Cores
GPU Max Clock rate: 1695 MHz (1.70 GHz)
Memory Clock rate: 9751 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 3
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.6, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

and when we run by if 0:

root@localhost:~# docker run -it --entrypoint /bin/bash --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 nvcr.io/nvidia/tensorflow:21.09-tf2-py3
root@77e5a4e8e1ad:/workspace# deviceQuery
deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

in this step untile we reboot vm, cant run any access of gpu by id 1!

Robert_Crovella · December 29, 2022, 11:57pm

virtualization (GPU passthrough) on GeForce is only supported when the VM OS is windows, see here.

Topic		Replies	Views
deviceQuery hangs on gpu id 0 Linux	0	309	December 26, 2022
deviceQuery hang on gpu id 0 Linux kernel , ubuntu	0	419	December 26, 2022
(EE) no screens found(EE) whyle trying to enable the nvidia driver on Debian 11 emulated with bhyve Linux	82	8927	December 15, 2021
GPU in a VM pass-through setting NVIDIA Virtual GPU Drivers	19	72528	April 29, 2021
nvidia-gpu not accessable in virt-manager guest by pci-passthrough/no dev-handle for gpu Linux	9	4398	August 14, 2019
7 out of 8 GPUs on G291 with Debian 10 Linux	17	1983	July 30, 2019
Using the driver of linux kvm 13.0vgpu, there is a problem with the display NVIDIA Virtual GPU Drivers	0	759	June 9, 2023
Tesla P4 PCI pass through from RHOSP 13 to RHEL 7.6/Windows VMs issues Linux	16	1868	October 12, 2021
Passthrough with cuda 11.2.2 and nvidia driver 460 does not work at all on Debian Linux	0	691	October 31, 2021
NVIDIA Tesla V100 SXM3 32 GBh GPU - Hardware driver	1	196	September 16, 2024

A problem in using Nvidia graphics cards

Related topics