Hi,
I have 4 Quadro Plex 7000 cases, so 8 GPU, and I wand to install them on an Centos 7.6 system.
But I fail to do so. I am working on a system with 2 Quadro Plex connected in PCIE, so 4 GPU “onboard”.
I can see the GPU using lspci:
[root@localhost ~]# lspci | grep -i nvidia
01:00.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a3)
02:00.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a3)
02:01.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a3)
02:02.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a3)
02:03.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a3)
05:00.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a2)
06:00.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a2)
06:02.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a2)
07:00.0 VGA compatible controller: NVIDIA Corporation GF100GL [Quadro 7000] (rev a1)
07:00.1 Audio device: NVIDIA Corporation GF110 High Definition Audio Controller (rev a1)
08:00.0 VGA compatible controller: NVIDIA Corporation GF100GL [Quadro 7000] (rev a1)
08:00.1 Audio device: NVIDIA Corporation GF110 High Definition Audio Controller (rev a1)
0a:00.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a3)
0b:00.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a3)
0b:01.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a3)
0b:02.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a3)
0b:03.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a3)
0e:00.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a2)
0f:00.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a2)
0f:02.0 PCI bridge: NVIDIA Corporation NF200 PCIe 2.0 switch for Quadro Plex S4 / Tesla S870 / Tesla S1070 / Tesla S2050 (rev a2)
10:00.0 VGA compatible controller: NVIDIA Corporation GF100GL [Quadro 7000] (rev a1)
11:00.0 VGA compatible controller: NVIDIA Corporation GF100GL [Quadro 7000] (rev a1)
[root@localhost ~]#
First I took the official driver, from Nvidia website, for Quadro Plex 7000: NVIDIA-Linux-x86_64-410.73.run
Quadro Plex 7000 is written supported in this driver.
But when running, I got an Error:
WARNING: The NVIDIA Quadro 7000 GPU installed in this system is supported through the NVIDIA 390.xx legacy Linux graphics drivers. Please visit
http://www.nvidia.com/object/unix.html for more information. The 410.73 NVIDIA Linux graphics driver will ignore this GPU.
WARNING: You do not appear to have an NVIDIA GPU supported by the 410.73 NVIDIA Linux graphics driver installed in this system. For further details,
please see the appendix SUPPORTED NVIDIA GRAPHICS CHIPS in the README available on the Linux driver download page at www.nvidia.com.
So I tried the one for Quadro 6000, NVIDIA-Linux-x86_64-390.87.run.
It install and build fine, even with DKMS.
But when using any CUDA program, or just using nvidia-smi, I got a big issue:
[ 373.505109] nvidia-nvlink: Nvlink Core is being initialized, major device number 240
[ 373.514195] vgaarb: device changed decodes: PCI:0000:07:00.0,olddecodes=none,decodes=none:owns=none
[ 373.523698] vgaarb: device changed decodes: PCI:0000:08:00.0,olddecodes=none,decodes=none:owns=none
[ 373.533108] vgaarb: device changed decodes: PCI:0000:10:00.0,olddecodes=none,decodes=none:owns=none
[ 373.542478] vgaarb: device changed decodes: PCI:0000:11:00.0,olddecodes=none,decodes=none:owns=none
[ 373.551954] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 390.87 Tue Aug 21 12:33:05 PDT 2018 (using threaded interrupts)
[ 397.937609] NMI watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [nvidia-smi:42308]
[ 397.945465] Modules linked in: nvidia(POE) ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter iTCO_wdt iTCO_vendor_support sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm snd_hda_codec_hdmi irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd snd_hda_intel snd_hda_codec ipmi_ssif snd_hda_core snd_hwdep pcspkr snd_seq snd_seq_device joydev snd_pcm sg snd_timer snd soundcore lpc_ich i2c_i801 ipmi_si mei_me mei ioatdma ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic ixgbe mgag200 i2c_algo_bit mdio drm_kms_helper crct10dif_pclmul ptp crct10dif_common syscopyarea pps_core mlx4_core sysfillrect sysimgblt ttm fb_sys_fops devlink crc32c_intel dca drm megaraid_sas ahci ipmi_devintf ipmi_msghandler libahci libata drm_panel_orientation_quirks [last unloaded: nvidia]
[ 398.055310] CPU: 24 PID: 42308 Comm: nvidia-smi Kdump: loaded Tainted: P OE ------------ 3.10.0-957.10.1.el7.x86_64 #1
[ 398.067121] Hardware name: Bull SAS bullx/X9QR7-TF+/X9QRi-F+, BIOS R28E3X32 11/08/2012
[ 398.075041] task: ffff965e77c08000 ti: ffff965e7dc54000 task.ti: ffff965e7dc54000
[ 398.082544] RIP: 0010:[<ffffffffc210f793>] [<ffffffffc210f793>] _nv029836rm+0x13/0x30 [nvidia]
[ 398.091494] RSP: 0018:ffff965e7dc57828 EFLAGS: 00000246
[ 398.096801] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 398.103926] RDX: ffffb737e6000000 RSI: ffff969e2ffd8008 RDI: ffff969e64316008
[ 398.111067] RBP: ffff969e30ea2de0 R08: 0000000000000020 R09: ffff969e30ea2dec
[ 398.118199] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000020
[ 398.125331] R13: ffff969e30ea2dec R14: 0000000000000000 R15: ffffffffc1df30a0
[ 398.132456] FS: 00007f20417c0740(0000) GS:ffff969e7f200000(0000) knlGS:0000000000000000
[ 398.140541] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 398.146279] CR2: 00007f20417cb000 CR3: 000000ff36328000 CR4: 00000000000607e0
[ 398.153437] Call Trace:
[ 398.156111] [<ffffffffc1e029d4>] ? _nv020968rm+0xe4/0x110 [nvidia]
[ 398.162574] [<ffffffffc1df317f>] ? _nv020928rm+0xdf/0x190 [nvidia]
[ 398.169027] [<ffffffffc1df3196>] ? _nv020928rm+0xf6/0x190 [nvidia]
[ 398.175517] [<ffffffffc1f72c00>] ? _nv024529rm+0x70/0xb0 [nvidia]
[ 398.181907] [<ffffffffc1f72139>] ? _nv024740rm+0xc9/0x160 [nvidia]
[ 398.188390] [<ffffffffc1f7295d>] ? _nv024759rm+0x6d/0x280 [nvidia]
[ 398.194871] [<ffffffffc1f55eed>] ? _nv024763rm+0x23d/0x260 [nvidia]
[ 398.201430] [<ffffffffc1f5bcde>] ? _nv012843rm+0x17e/0x1f0 [nvidia]
[ 398.207960] [<ffffffffc207112e>] ? _nv024408rm+0xbe/0x2f0 [nvidia]
[ 398.214388] [<ffffffffc2070dfd>] ? _nv024409rm+0x28d/0x500 [nvidia]
[ 398.220891] [<ffffffffc2088c41>] ? _nv029348rm+0x4d1/0x590 [nvidia]
[ 398.227390] [<ffffffffc2088daa>] ? _nv029380rm+0xaa/0x1e0 [nvidia]
[ 398.233812] [<ffffffffc2088f20>] ? _nv029347rm+0x40/0x50 [nvidia]
[ 398.240108] [<ffffffffc21211a4>] ? _nv001159rm+0x244/0x680 [nvidia]
[ 398.246589] [<ffffffffc2115ada>] ? rm_init_adapter+0x11a/0x130 [nvidia]
[ 398.253302] [<ffffffff8fcd6701>] ? try_to_wake_up+0x361/0x390
[ 398.259205] [<ffffffffc1aa42e0>] ? nv_open_device+0x380/0x760 [nvidia]
[ 398.265849] [<ffffffff8fe1c1c5>] ? kmem_cache_alloc+0x35/0x1f0
[ 398.271840] [<ffffffffc1aa4adc>] ? nvidia_open+0x14c/0x300 [nvidia]
[ 398.278262] [<ffffffffc1aa2388>] ? nvidia_frontend_open+0x58/0xb0 [nvidia]
[ 398.285230] [<ffffffff8fe46e35>] ? chrdev_open+0xb5/0x1b0
[ 398.290718] [<ffffffff8fe3eeca>] ? do_dentry_open+0x1aa/0x2e0
[ 398.296576] [<ffffffff8fef9252>] ? security_inode_permission+0x22/0x30
[ 398.303183] [<ffffffff8fe46d80>] ? cdev_put+0x30/0x30
[ 398.308323] [<ffffffff8fe3f09a>] ? vfs_open+0x5a/0xb0
[ 398.313496] [<ffffffff8fe4d5a8>] ? may_open+0x68/0x120
[ 398.318722] [<ffffffff8fe4fbad>] ? do_last+0x1ed/0x12a0
[ 398.324043] [<ffffffff8fe52a67>] ? path_openat+0xd7/0x640
[ 398.329585] [<ffffffff8ff04624>] ? selinux_inode_setattr+0x104/0x110
[ 398.336020] [<ffffffff8fe5446d>] ? do_filp_open+0x4d/0xb0
[ 398.341562] [<ffffffff8fe61af7>] ? __alloc_fd+0x47/0x170
[ 398.346959] [<ffffffff8fe40597>] ? do_sys_open+0x137/0x240
[ 398.352596] [<ffffffff90375d15>] ? system_call_after_swapgs+0xa2/0x146
[ 398.359202] [<ffffffff8fe406be>] ? SyS_open+0x1e/0x20
[ 398.364343] [<ffffffff90375ddb>] ? system_call_fastpath+0x22/0x27
[ 398.370561] [<ffffffff90375d21>] ? system_call_after_swapgs+0xae/0x146
[ 398.377166] Code: 31 ff e8 91 14 00 00 48 89 c7 e8 69 a1 f9 ff 0f b7 c3 5b c3 0f 1f 40 00 53 31 db 39 4a 10 76 0f 48 8b 12 c1 e9 02 89 c8 8b 1c 82 <89> d8 5b c3 31 ff e8 62 14 00 00 48 89 c7 e8 3a a1 f9 ff 89 d8
Message from syslogd@localhost at Apr 26 14:06:47 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [nvidia-smi:42308]
And of course, system if crashed. I have to restart it through IPMI to make it on again.
My kernel is 3.10.0-957.10.1.el7.x86_64.
And dmesg for nvidia:
[root@localhost ~]# dmesg | grep -i nvidia
[ 10.474866] nvidia: loading out-of-tree module taints kernel.
[ 10.480874] nvidia: module license 'NVIDIA' taints kernel.
[ 10.490912] nvidia: module license 'NVIDIA' taints kernel.
[ 10.600378] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 10.726364] nvidia-nvlink: Nvlink Core is being initialized, major device number 240
[ 10.734935] nvidia 0000:07:00.0: enabling device (0000 -> 0003)
[ 10.751007] nvidia 0000:08:00.0: enabling device (0000 -> 0003)
[ 10.781349] nvidia 0000:10:00.0: enabling device (0000 -> 0003)
[ 10.808070] nvidia 0000:11:00.0: enabling device (0000 -> 0003)
[ 10.832814] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 390.87 Tue Aug 21 12:33:05 PDT 2018 (using threaded interrupts)
[ 10.913963] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 390.87 Tue Aug 21 16:16:14 PDT 2018
[ 10.933198] [drm] [nvidia-drm] [GPU ID 0x00000700] Loading driver
[ 10.945858] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:07:00.0 on minor 1
[ 10.962975] [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
[ 10.969078] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 2
[ 10.977775] [drm] [nvidia-drm] [GPU ID 0x00001000] Loading driver
[ 10.983876] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:10:00.0 on minor 3
[ 10.993050] [drm] [nvidia-drm] [GPU ID 0x00001100] Loading driver
[ 10.999155] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:11:00.0 on minor 4
[ 19.895837] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:05:00.0/0000:06:02.0/0000:08:00.1/sound/card1/input7
[ 19.911157] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:05:00.0/0000:06:02.0/0000:08:00.1/sound/card1/input8
[ 19.928875] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:05:00.0/0000:06:02.0/0000:08:00.1/sound/card1/input9
[ 19.954701] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.1/sound/card0/input11
[ 19.954937] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.1/sound/card0/input12
[ 19.955035] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.1/sound/card0/input13
[ 19.955086] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:05:00.0/0000:06:02.0/0000:08:00.1/sound/card1/input10
[ 19.955208] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.1/sound/card0/input14
[root@localhost ~]#
Does one of you have an idea on how to make these devices work ? I need to configure these 2 servers to get the 8 cards running.
Kind regards
Ox