Tesla P4 PCI pass through from RHOSP 13 to RHEL 7.6/Windows VMs issues

kkadak · June 6, 2019, 4:41pm

We are currently troubleshooting an issue where we are attempting to pass the physical GPU(s) (via PCI-E pass through) of our compute nodes to the virtual machines. The virtual machine sees the GPU but we are unable to actually use/leverage it for our 1) primary display adapter and 2) use it as the primary OpenGL renderer string.

We are using the latest drivers on the VMs (418.74).

The current version of OpenStack we are running does not support vGPU, hence the PCI passthrough.

The Tesla P4 is identifying itself as a 3D controller and not a VGA compatible controller, so I don’t beleive that we can use it as our primary display adapter (is this true?). We have attempting to install some of the nvidia tools (nvidia-gpumodeswitch, etc.) but those don’t seem to be applicable to our device. Does this PCI device subclass “actually” matter?

Running any of the GPU benchmarking tools, like unigine or furmark or glxgears, are basically reporting that there is no GPU on the system/not using it, but it is definitely “seen” by the OS. Windows VM device manager reports it as a Dispaly adapter after driver install and RHEL VM output is below:

[root@rhel-gpu-1 ~]# lspci -nnk
…
00:05.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d8]
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia

the display options in RHEL:
[root@rhel-gpu-1 ~]# lshw -c display
*-display:0
description: VGA compatible controller
product: GD 5446
vendor: Cirrus Logic
physical id: 2
bus info: pci@0000:00:02.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: vga_controller rom
configuration: driver=cirrus latency=0
resources: irq:0 memory:f0000000-f1ffffff memory:fe050000-fe050fff memory:fe040000-fe04ffff
*-display:1
description: 3D controller
product: GP104GL [Tesla P4]
vendor: NVIDIA Corporation
physical id: 5
bus info: pci@0000:00:05.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list
configuration: driver=nvidia latency=0
resources: irq:11 memory:fd000000-fdffffff memory:e0000000-efffffff memory:f2000000-f3ffffff

The lspci output on the compute host, where the physical GPUs are located, are setup to use vfio-pci drivers (see below), so I am not sure what else we might be missing.

3b:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d8]
Kernel driver in use: vfio-pci
Kernel modules: nouveau

d8:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d8]
Kernel driver in use: vfio-pci
Kernel modules: nouveau

af:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d8]
Kernel driver in use: vfio-pci
Kernel modules: nouveau

This seems to be something simple/stupid, but has anyone else encountered similar issues?

nvidia-bug-report.log.gz (1.06 MB)

generix · June 7, 2019, 8:37am

The kernel driver loads fine, just an xserver configuration issue. Please remove your xorg.conf and replace it with just this:

Section "Device"
    Identifier     "nvidia"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:0:5:0"
    Option         "UseDisplayDevice" "none"
    Option         "AllowEmptyInitialConfiguration"
EndSection

kkadak · June 7, 2019, 1:07pm

@generix

I made the change to /etc/X11/xorg.conf and rebooted. At this point I am not seeing any indication that I am using the gpu. I am attempting to test/prove this with glxgears and if I run nvidia-smi while that process is running, I get the following output:

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

If I use my initial configuration that was generated with nvidia-xconfig I was originally able to start the X server (X -config <path to nvidia-xconfig file) and that X server process was present in nvidia-smi. Running glxgears at that time still did not appear to leverage the GPU. With the new config I am unable to start the X server. Is this an OpenGL issue? Is there any way that I can prove a process is leveraging the GPU for 3D rendering?

glxinfo when run from an SSH session

name of display: localhost:10.0
display: localhost:10 screen: 0
direct rendering: No (If you want to find out why, try setting LIBGL_DEBUG=verbose)
server glx vendor string: SGI
server glx version string: 1.2
server glx extensions:
GLX_ARB_multisample, GLX_EXT_import_context, GLX_EXT_visual_info,
GLX_EXT_visual_rating, GLX_OML_swap_method, GLX_SGIS_multisample,
GLX_SGIX_fbconfig, GLX_SGIX_hyperpipe, GLX_SGIX_swap_barrier,
GLX_SGI_make_current_read
client glx vendor string: NVIDIA Corporation
client glx version string: 1.4
client glx extensions:
GLX_ARB_context_flush_control, GLX_ARB_create_context,
GLX_ARB_create_context_no_error, GLX_ARB_create_context_profile,
GLX_ARB_create_context_robustness, GLX_ARB_fbconfig_float,
GLX_ARB_get_proc_address, GLX_ARB_multisample, GLX_EXT_buffer_age,
GLX_EXT_create_context_es2_profile, GLX_EXT_create_context_es_profile,
GLX_EXT_fbconfig_packed_float, GLX_EXT_framebuffer_sRGB,
GLX_EXT_import_context, GLX_EXT_stereo_tree, GLX_EXT_swap_control,
GLX_EXT_swap_control_tear, GLX_EXT_texture_from_pixmap,
GLX_EXT_visual_info, GLX_EXT_visual_rating, GLX_NV_copy_buffer,
GLX_NV_copy_image, GLX_NV_delay_before_swap, GLX_NV_float_buffer,
GLX_NV_multisample_coverage, GLX_NV_present_video,
GLX_NV_robustness_video_memory_purge, GLX_NV_swap_group,
GLX_NV_video_capture, GLX_NV_video_out, GLX_SGIX_fbconfig,
GLX_SGIX_pbuffer, GLX_SGI_swap_control, GLX_SGI_video_sync
GLX version: 1.2
GLX extensions:
GLX_ARB_get_proc_address, GLX_ARB_multisample, GLX_EXT_import_context,
GLX_EXT_visual_info, GLX_EXT_visual_rating, GLX_SGIX_fbconfig
OpenGL vendor string: Intel
OpenGL renderer string: Intel(R) HD Graphics 630
OpenGL version string: 1.2 (4.4.0 - Build 22.20.16.4691)
OpenGL extensions:
GL_ARB_depth_texture, GL_ARB_multitexture, GL_ARB_point_parameters,
GL_ARB_point_sprite, GL_ARB_shadow, GL_ARB_texture_border_clamp,
GL_ARB_texture_cube_map, GL_ARB_texture_env_add,
GL_ARB_texture_env_combine, GL_ARB_texture_env_crossbar,
GL_ARB_texture_env_dot3, GL_ARB_texture_mirrored_repeat,
GL_ARB_transpose_matrix, GL_ARB_window_pos, GL_EXT_abgr, GL_EXT_bgra,
GL_EXT_blend_color, GL_EXT_blend_func_separate, GL_EXT_blend_minmax,
GL_EXT_blend_subtract, GL_EXT_draw_range_elements, GL_EXT_fog_coord,
GL_EXT_multi_draw_arrays, GL_EXT_packed_pixels, GL_EXT_rescale_normal,
GL_EXT_secondary_color, GL_EXT_separate_specular_color,
GL_EXT_shadow_funcs, GL_EXT_stencil_two_side, GL_EXT_stencil_wrap,
GL_EXT_texture3D, GL_EXT_texture_edge_clamp, GL_EXT_texture_env_add,
GL_EXT_texture_env_combine, GL_EXT_texture_lod_bias,
GL_IBM_texture_mirrored_repeat, GL_NV_blend_square,
GL_NV_texgen_reflection, GL_SGIS_generate_mipmap, GL_SGIS_texture_lod

Thanks!

nvidia-bug-report0607.log.gz (1.02 MB)

generix · June 7, 2019, 1:29pm

Please provide a new nvidia-bug-report.log with the config from my post in place.
The output of glxinfo via ssh is using indirect glx, i.e. it’s from your local machine.

kkadak · June 7, 2019, 1:34pm

Attaching now. Thanks!

generix · June 7, 2019, 2:22pm

You renamed the old xorg.conf to xorg.conf.manualbackup and since it begins with ‘xorg.conf’ it’s used anyway. Please remove it completely so it uses the xorg.conf from my post, then create a new nvidia-bug-report.log.

kkadak · June 7, 2019, 2:28pm

Hmm, interesting, there were multiple in there, so i deleted all of them.

Uploading to this post now.

nvidia-bug-report0607-2.log.gz (1010 KB)

generix · June 7, 2019, 2:46pm

Ok, looks better now. The error now is

[  3250.013] (EE) NVIDIA(GPU-0): UseDisplayDevice "None" is not supported with GRID
[  3250.013] (EE) NVIDIA(GPU-0):     displayless
[  3250.013] (EE) NVIDIA(GPU-0): Failed to select a display subsystem.

which is kind of funny since you’re not using vGPU, or are you using the grid driver?
Please comment out that option in the xorg.conf and create a new nvidia-bug-report.log.

kkadak · June 7, 2019, 4:03pm

Agreed, this is something I have been struggling with. We are not (intending to use) using GRID, since our version of RHOSP does not currently support vGPUS. We are just downloading the drivers directly from NVIDIA,

Option 1: Manually find drivers for my NVIDIA products. Help
Product Type: Tesla
Product Series: P-Series
Product: Tesla P4
Operating System: Linux 64-bit RHEL 7
CUDA Toolkit: 10.1

TESLA DRIVER FOR LINUX RHEL 7

Version: 418.67
Release Date: 2019.5.7
Operating System: Linux 64-bit RHEL7
CUDA Toolkit: 10.1
Language: English (US)
File Size: 154.4 MB

Is this something maybe that we need to communicate to the BIOS of the card? attaching a new bug report.

nvidia-bug-report0607-3.log.gz (974 KB)

kkadak · June 7, 2019, 4:06pm

Not sure if the comment worked, so I just deleted the line entirely and submitting a new bug-report here.
nvidia-bug-report0607-4.log.gz (942 KB)

generix · June 7, 2019, 4:44pm

Don’t really know what’s happening now since no new xorg log has been created so I suspect gdm has started, please run
sudo journalctl -b0 --no-pager _COMM=gdm-x-session >xorg.log
and attach the output file.

kkadak · June 7, 2019, 4:54pm

Only thing that was outputted in that log is

– No entries –

However, if I manually start X I am back in my previous state:

sudo X

X.Org X Server 1.20.1
X Protocol Version 11, Revision 0
Build Operating System: 2.6.32-754.2.1.el6.x86_64
Current Operating System: Linux rhel-gpu-1 3.10.0-957.10.1.el7.x86_64 #1 SMP Thu Feb 7 07:12:53 UTC 2019 x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-957.10.1.el7.x86_64 root=UUID=6c248666-70f5-4037-8b24-17100c2f5c1e ro console=tty0 crashkernel=auto console=ttyS0,115200n8 no_timer_check net.ifnames=0 modprobe.blacklist=nouveau
Build Date: 13 February 2019 01:35:02PM
Build ID: xorg-x11-server 1.20.1-5.3.el7_6
Current version of pixman: 0.34.0
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (–) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: “/var/log/Xorg.0.log”, Time: Fri Jun 7 12:48:28 2019
(==) Using config file: “/etc/X11/xorg.conf”

nvidia-smi

Fri Jun  7 12:48:50 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.74       Driver Version: 418.74       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   33C    P8     6W /  75W |     22MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      9756      G   X                                             22MiB |
+-----------------------------------------------------------------------------+

is there a way to launch glxgears or another program to take advantage of that X session to use the GPU for 3D rendering?

I am also attaching another bug report with the new X session running.

Thanks again for the help, really appreciate it.

nvidia-bug-report-0607-5.log.gz (918 KB)

generix · June 7, 2019, 5:16pm

Looks good now

[ 11881.958] (II) NVIDIA(0): Virtual screen size determined to be 2560 x 1600

To run something on it over ssh, you’ll have to prepend the DISPLAY, e.g.
DISPLAY=:0 glxgears

kkadak · June 7, 2019, 5:23pm

THANK YOU! was getting something like 1200 FPS but now getting 40000 FPS. That satisfies my requirement to show that we can actually use PCI pass through of the GPU.

Thanks again!

generix · June 7, 2019, 6:13pm

Glad it works now. Just as a note, this forum uses Super-AI (aka really psychotic) spam protection so your post #9 wasn’t visible until now. AFAIK, the Tesla drivers have some drawbacks regarding graphics use, e.g. maybe no 32bit application compatibility, so YMMV. Check out the general graphics drivers if you hit problems.

zero.module · April 30, 2021, 1:09pm

generix, thank you so much for this advice! I’ve spent almost the whole week trying to solve similar issue!

In my case, I was getting error “pci id for fd 15: 1013:00b8, driver (null)
EGL_MESA_drm_image required”.

Topic		Replies	Views
Can I do remote direct rendering with Tesla P4 on CentOS 7? Linux	11	3485	March 14, 2018
Ubuntu 19.04 Driver Installed but not Used Linux	102	16157	October 12, 2021
Tesla C870 and Linux RHEL 4.5 CUDA Programming and Performance	13	28862	February 28, 2008
2 Tesla C1060s with a legacy GeForce FX 5200 card Need help editing the xorg.conf file for multiple CUDA Programming and Performance	28	35534	January 29, 2009
OpenGL, NVIDIA and Ubuntu 14.04 issues Linux	28	17371	September 22, 2017
Llvmpipe is used instead of NVIDIA GPU. nvidia-settings not working and cannot switch to NVIDIA GPU Linux	32	47130	March 3, 2022
not able to update Tesla P100 driver 384 to 418 Linux	119	5144	November 12, 2019
CentOS 7 headless with nVidia drivers installed, OpenGL not using nVidia drivers, only llvmpipe Linux opengl , linux	44	5149	May 10, 2022
Nvidia-settings: Attribute GPUMemoryTransferRateOffsetAllPerformanceLevels (...) is not available Linux	8	2510	February 7, 2021
ubuntu 18.04 still uses llvmpipe driver Linux	15	28833	October 12, 2021

Tesla P4 PCI pass through from RHOSP 13 to RHEL 7.6/Windows VMs issues

Related topics