Tesla P4 PCI pass through from RHOSP 13 to RHEL 7.6/Windows VMs issues

We are currently troubleshooting an issue where we are attempting to pass the physical GPU(s) (via PCI-E pass through) of our compute nodes to the virtual machines. The virtual machine sees the GPU but we are unable to actually use/leverage it for our 1) primary display adapter and 2) use it as the primary OpenGL renderer string.

We are using the latest drivers on the VMs (418.74).

The current version of OpenStack we are running does not support vGPU, hence the PCI passthrough.

The Tesla P4 is identifying itself as a 3D controller and not a VGA compatible controller, so I don’t beleive that we can use it as our primary display adapter (is this true?). We have attempting to install some of the nvidia tools (nvidia-gpumodeswitch, etc.) but those don’t seem to be applicable to our device. Does this PCI device subclass “actually” matter?

Running any of the GPU benchmarking tools, like unigine or furmark or glxgears, are basically reporting that there is no GPU on the system/not using it, but it is definitely “seen” by the OS. Windows VM device manager reports it as a Dispaly adapter after driver install and RHEL VM output is below:

[root@rhel-gpu-1 ~]# lspci -nnk

00:05.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d8]
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia

the display options in RHEL:
[root@rhel-gpu-1 ~]# lshw -c display
*-display:0
description: VGA compatible controller
product: GD 5446
vendor: Cirrus Logic
physical id: 2
bus info: pci@0000:00:02.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: vga_controller rom
configuration: driver=cirrus latency=0
resources: irq:0 memory:f0000000-f1ffffff memory:fe050000-fe050fff memory:fe040000-fe04ffff
*-display:1
description: 3D controller
product: GP104GL [Tesla P4]
vendor: NVIDIA Corporation
physical id: 5
bus info: pci@0000:00:05.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list
configuration: driver=nvidia latency=0
resources: irq:11 memory:fd000000-fdffffff memory:e0000000-efffffff memory:f2000000-f3ffffff

The lspci output on the compute host, where the physical GPUs are located, are setup to use vfio-pci drivers (see below), so I am not sure what else we might be missing.

3b:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d8]
Kernel driver in use: vfio-pci
Kernel modules: nouveau

d8:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d8]
Kernel driver in use: vfio-pci
Kernel modules: nouveau

af:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d8]
Kernel driver in use: vfio-pci
Kernel modules: nouveau

This seems to be something simple/stupid, but has anyone else encountered similar issues?

nvidia-bug-report.log.gz (1.06 MB)

The kernel driver loads fine, just an xserver configuration issue. Please remove your xorg.conf and replace it with just this:

Section "Device"
    Identifier     "nvidia"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:0:5:0"
    Option         "UseDisplayDevice" "none"
    Option         "AllowEmptyInitialConfiguration"
EndSection

@generix

I made the change to /etc/X11/xorg.conf and rebooted. At this point I am not seeing any indication that I am using the gpu. I am attempting to test/prove this with glxgears and if I run nvidia-smi while that process is running, I get the following output:

Fri Jun 7 08:56:17 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.74 Driver Version: 418.74 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P4 Off | 00000000:00:05.0 Off | 0 |
| N/A 28C P8 6W / 75W | 0MiB / 7611MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

If I use my initial configuration that was generated with nvidia-xconfig I was originally able to start the X server (X -config <path to nvidia-xconfig file) and that X server process was present in nvidia-smi. Running glxgears at that time still did not appear to leverage the GPU. With the new config I am unable to start the X server. Is this an OpenGL issue? Is there any way that I can prove a process is leveraging the GPU for 3D rendering?

glxinfo when run from an SSH session

name of display: localhost:10.0
display: localhost:10 screen: 0
direct rendering: No (If you want to find out why, try setting LIBGL_DEBUG=verbose)
server glx vendor string: SGI
server glx version string: 1.2
server glx extensions:
GLX_ARB_multisample, GLX_EXT_import_context, GLX_EXT_visual_info,
GLX_EXT_visual_rating, GLX_OML_swap_method, GLX_SGIS_multisample,
GLX_SGIX_fbconfig, GLX_SGIX_hyperpipe, GLX_SGIX_swap_barrier,
GLX_SGI_make_current_read
client glx vendor string: NVIDIA Corporation
client glx version string: 1.4
client glx extensions:
GLX_ARB_context_flush_control, GLX_ARB_create_context,
GLX_ARB_create_context_no_error, GLX_ARB_create_context_profile,
GLX_ARB_create_context_robustness, GLX_ARB_fbconfig_float,
GLX_ARB_get_proc_address, GLX_ARB_multisample, GLX_EXT_buffer_age,
GLX_EXT_create_context_es2_profile, GLX_EXT_create_context_es_profile,
GLX_EXT_fbconfig_packed_float, GLX_EXT_framebuffer_sRGB,
GLX_EXT_import_context, GLX_EXT_stereo_tree, GLX_EXT_swap_control,
GLX_EXT_swap_control_tear, GLX_EXT_texture_from_pixmap,
GLX_EXT_visual_info, GLX_EXT_visual_rating, GLX_NV_copy_buffer,
GLX_NV_copy_image, GLX_NV_delay_before_swap, GLX_NV_float_buffer,
GLX_NV_multisample_coverage, GLX_NV_present_video,
GLX_NV_robustness_video_memory_purge, GLX_NV_swap_group,
GLX_NV_video_capture, GLX_NV_video_out, GLX_SGIX_fbconfig,
GLX_SGIX_pbuffer, GLX_SGI_swap_control, GLX_SGI_video_sync
GLX version: 1.2
GLX extensions:
GLX_ARB_get_proc_address, GLX_ARB_multisample, GLX_EXT_import_context,
GLX_EXT_visual_info, GLX_EXT_visual_rating, GLX_SGIX_fbconfig
OpenGL vendor string: Intel
OpenGL renderer string: Intel® HD Graphics 630
OpenGL version string: 1.2 (4.4.0 - Build 22.20.16.4691)
OpenGL extensions:
GL_ARB_depth_texture, GL_ARB_multitexture, GL_ARB_point_parameters,
GL_ARB_point_sprite, GL_ARB_shadow, GL_ARB_texture_border_clamp,
GL_ARB_texture_cube_map, GL_ARB_texture_env_add,
GL_ARB_texture_env_combine, GL_ARB_texture_env_crossbar,
GL_ARB_texture_env_dot3, GL_ARB_texture_mirrored_repeat,
GL_ARB_transpose_matrix, GL_ARB_window_pos, GL_EXT_abgr, GL_EXT_bgra,
GL_EXT_blend_color, GL_EXT_blend_func_separate, GL_EXT_blend_minmax,
GL_EXT_blend_subtract, GL_EXT_draw_range_elements, GL_EXT_fog_coord,
GL_EXT_multi_draw_arrays, GL_EXT_packed_pixels, GL_EXT_rescale_normal,
GL_EXT_secondary_color, GL_EXT_separate_specular_color,
GL_EXT_shadow_funcs, GL_EXT_stencil_two_side, GL_EXT_stencil_wrap,
GL_EXT_texture3D, GL_EXT_texture_edge_clamp, GL_EXT_texture_env_add,
GL_EXT_texture_env_combine, GL_EXT_texture_lod_bias,
GL_IBM_texture_mirrored_repeat, GL_NV_blend_square,
GL_NV_texgen_reflection, GL_SGIS_generate_mipmap, GL_SGIS_texture_lod

Thanks!

nvidia-bug-report0607.log.gz (1.02 MB)

Please provide a new nvidia-bug-report.log with the config from my post in place.
The output of glxinfo via ssh is using indirect glx, i.e. it’s from your local machine.

Attaching now. Thanks!

You renamed the old xorg.conf to xorg.conf.manualbackup and since it begins with ‘xorg.conf’ it’s used anyway. Please remove it completely so it uses the xorg.conf from my post, then create a new nvidia-bug-report.log.

Hmm, interesting, there were multiple in there, so i deleted all of them.

Uploading to this post now.

nvidia-bug-report0607-2.log.gz (1010 KB)

Ok, looks better now. The error now is

[  3250.013] (EE) NVIDIA(GPU-0): UseDisplayDevice "None" is not supported with GRID
[  3250.013] (EE) NVIDIA(GPU-0):     displayless
[  3250.013] (EE) NVIDIA(GPU-0): Failed to select a display subsystem.

which is kind of funny since you’re not using vGPU, or are you using the grid driver?
Please comment out that option in the xorg.conf and create a new nvidia-bug-report.log.

Agreed, this is something I have been struggling with. We are not (intending to use) using GRID, since our version of RHOSP does not currently support vGPUS. We are just downloading the drivers directly from NVIDIA,

https://www.nvidia.com/download/index.aspx

Option 1: Manually find drivers for my NVIDIA products. Help
Product Type: Tesla
Product Series: P-Series
Product: Tesla P4
Operating System: Linux 64-bit RHEL 7
CUDA Toolkit: 10.1

TESLA DRIVER FOR LINUX RHEL 7

Version: 418.67
Release Date: 2019.5.7
Operating System: Linux 64-bit RHEL7
CUDA Toolkit: 10.1
Language: English (US)
File Size: 154.4 MB

Is this something maybe that we need to communicate to the BIOS of the card? attaching a new bug report.

nvidia-bug-report0607-3.log.gz (974 KB)

Not sure if the comment worked, so I just deleted the line entirely and submitting a new bug-report here.
nvidia-bug-report0607-4.log.gz (942 KB)

Don’t really know what’s happening now since no new xorg log has been created so I suspect gdm has started, please run
sudo journalctl -b0 --no-pager _COMM=gdm-x-session >xorg.log
and attach the output file.

Only thing that was outputted in that log is

– No entries –

However, if I manually start X I am back in my previous state:

sudo X

X.Org X Server 1.20.1
X Protocol Version 11, Revision 0
Build Operating System: 2.6.32-754.2.1.el6.x86_64
Current Operating System: Linux rhel-gpu-1 3.10.0-957.10.1.el7.x86_64 #1 SMP Thu Feb 7 07:12:53 UTC 2019 x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-957.10.1.el7.x86_64 root=UUID=6c248666-70f5-4037-8b24-17100c2f5c1e ro console=tty0 crashkernel=auto console=ttyS0,115200n8 no_timer_check net.ifnames=0 modprobe.blacklist=nouveau
Build Date: 13 February 2019 01:35:02PM
Build ID: xorg-x11-server 1.20.1-5.3.el7_6
Current version of pixman: 0.34.0
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (–) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: “/var/log/Xorg.0.log”, Time: Fri Jun 7 12:48:28 2019
(==) Using config file: “/etc/X11/xorg.conf”

nvidia-smi

Fri Jun  7 12:48:50 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.74       Driver Version: 418.74       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   33C    P8     6W /  75W |     22MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      9756      G   X                                             22MiB |
+-----------------------------------------------------------------------------+

is there a way to launch glxgears or another program to take advantage of that X session to use the GPU for 3D rendering?

I am also attaching another bug report with the new X session running.

Thanks again for the help, really appreciate it.

nvidia-bug-report-0607-5.log.gz (918 KB)

Looks good now

[ 11881.958] (II) NVIDIA(0): Virtual screen size determined to be 2560 x 1600

To run something on it over ssh, you’ll have to prepend the DISPLAY, e.g.
DISPLAY=:0 glxgears

THANK YOU! was getting something like 1200 FPS but now getting 40000 FPS. That satisfies my requirement to show that we can actually use PCI pass through of the GPU.

Thanks again!

Glad it works now. Just as a note, this forum uses Super-AI (aka really psychotic) spam protection so your post #9 wasn’t visible until now. AFAIK, the Tesla drivers have some drawbacks regarding graphics use, e.g. maybe no 32bit application compatibility, so YMMV. Check out the general graphics drivers if you hit problems.

generix, thank you so much for this advice! I’ve spent almost the whole week trying to solve similar issue!

In my case, I was getting error “pci id for fd 15: 1013:00b8, driver (null)
EGL_MESA_drm_image required”.