We are currently troubleshooting an issue where we are attempting to pass the physical GPU(s) (via PCI-E pass through) of our compute nodes to the virtual machines. The virtual machine sees the GPU but we are unable to actually use/leverage it for our 1) primary display adapter and 2) use it as the primary OpenGL renderer string.
We are using the latest drivers on the VMs (418.74).
The current version of OpenStack we are running does not support vGPU, hence the PCI passthrough.
The Tesla P4 is identifying itself as a 3D controller and not a VGA compatible controller, so I don’t beleive that we can use it as our primary display adapter (is this true?). We have attempting to install some of the nvidia tools (nvidia-gpumodeswitch, etc.) but those don’t seem to be applicable to our device. Does this PCI device subclass “actually” matter?
Running any of the GPU benchmarking tools, like unigine or furmark or glxgears, are basically reporting that there is no GPU on the system/not using it, but it is definitely “seen” by the OS. Windows VM device manager reports it as a Dispaly adapter after driver install and RHEL VM output is below:
[root@rhel-gpu-1 ~]# lspci -nnk
…
00:05.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d8]
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
the display options in RHEL:
[root@rhel-gpu-1 ~]# lshw -c display
*-display:0
description: VGA compatible controller
product: GD 5446
vendor: Cirrus Logic
physical id: 2
bus info: pci@0000:00:02.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: vga_controller rom
configuration: driver=cirrus latency=0
resources: irq:0 memory:f0000000-f1ffffff memory:fe050000-fe050fff memory:fe040000-fe04ffff
*-display:1
description: 3D controller
product: GP104GL [Tesla P4]
vendor: NVIDIA Corporation
physical id: 5
bus info: pci@0000:00:05.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list
configuration: driver=nvidia latency=0
resources: irq:11 memory:fd000000-fdffffff memory:e0000000-efffffff memory:f2000000-f3ffffff
The lspci output on the compute host, where the physical GPUs are located, are setup to use vfio-pci drivers (see below), so I am not sure what else we might be missing.
3b:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d8]
Kernel driver in use: vfio-pci
Kernel modules: nouveau
d8:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d8]
Kernel driver in use: vfio-pci
Kernel modules: nouveau
af:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d8]
Kernel driver in use: vfio-pci
Kernel modules: nouveau
This seems to be something simple/stupid, but has anyone else encountered similar issues?
nvidia-bug-report.log.gz (1.06 MB)