Can I insert a P4 when I already have an M5000 in my Linux desktop?

I was using my M5000 in my desktop. Then, I bought a P4 and added it in another slot. Now, I cannot run my code on either one. I can see the two NVidia cards using lspci.

I executed: rpm -I nvidia-diag-driver-local-repo-rhel7-418.67-1.0-1.x86_64.rpm.
I received a NOKEY. Does this mean it installed?

I want to attach my logs to this Topic but I don’t see how. Here is part of the log file that is interesting. Notice the ncurses, v6, user and interface are unable to load. How do I load these?

[root@MSSDR17041103 tcurry3]# more /var/log/nvidia-installer.log
nvidia-installer log file ‘/var/log/nvidia-installer.log’
creation time: Tue Jun 11 14:36:06 2019
installer version: 418.67

PATH: /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda-10.1/bin:/home/tcurry3/julia-1.1.1/bin

nvidia-installer command line:
nvidia-installer

Unable to load: nvidia-installer ncurses v6 user interface

Using: nvidia-installer ncurses user interface
→ Detected 32 CPUs online; setting concurrency level to 32.
→ Tagging shared libraries with chcon -t textrel_shlib_t.
ERROR: No package found for installation. Please run this utility with the ‘–help’ option for usag
e information.
ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. Y
ou may find suggestions on fixing installation problems in the README available on the Linux driver
download page at www.nvidia.com.

nvidia-bug-report.log.gz (76.9 KB)

Since you already had a Quadro installed, how did you previously install the driver?
The NOKEY message means that you tried to install that rpm from a repo for which the public key is not installed, so it failed.
The ncurses message is a red herring, it just gives a fancy menu instead of just a text interface. Please stay away from the .run installer, it might break your system.
Let’s see if there’s already a driiver installed:
Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
[url]https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/[/url]

Thanks, generix for your quick reply.

I switched from Internet Explorer to Google Chrome and now I can clearly see the attach file paperclip. Please see attachment for log file.

I personally did not install the older M5000 driver so we will have to look at the logs for those driver attributes.

I will have to find the public key for the installer that I tried to use. I typed:
rpm -i nvidia-diag-driver-local-repo-rhel7-418.67-1.0-1.x86_64 when I received the NOKEY error.

I cannot find nvidia’s public key where I downloaded the linux driver:
nvidia-diag-driver-local-repo-rhel7-418.67-1.0-1.x86_64.rpm

I have several public keys for NASA’s repositories but I downloaded the above rpm from NVidia. Maybe I am installing this wrong.

You can use the --nosignature switch to skip the key check.

I added the --nosignature to my rpm command. Thanks. I no longer get the NOKEY nag. The rpm command states that the nvidia-diag-driver-local-repo-rhel7-418.67-1.0-1.x86_64.rpm was already installed.

I still think there is a conflict between the 5 year old NVidia M5000 and the new NVidia P4 hardware. Do you think the M5000 can use the new driver? Does the posted nvidia bug report file explain the issues?

I have been getting the following gpu code error ever since I installed the P4:
Error: magma_dmalloc( &d_A, lddaNbatchCount )
failed at testing/testing_dgesv_batched.cpp:86: error -113: cannot allocate memory on GPU device

I see my two Nvidia card using lspci:
[root@MSSDR17041103 testing]# lspci | grep -i nvidia
03:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev ff)
04:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [Quadro M5000] (rev a1)
04:00.1 Audio device: NVIDIA Corporation GM204 High Definition Audio Controller (rev a1)

The Quadro M5000 is a Maxwell gen gpu, it is well supported by the latest driver and cuda.
After you added the local repo rpm, you still have to install the driver:
ii) yum clean all
iii) yum install cuda-drivers
iv) reboot
afterwards, please create a new nvidia-bug-report.log so I can see the current state.

Wow, I did not know that. I typed ‘yum clean all’ and it seemed to work. Then, I typed ‘yum install cuda-drivers’ and I received several ‘will be installed’ messages but then I received:

→ Processing Conflict: nvidia-x11-drv-390.67-1.el7_5.elrepo.x86_64 conflicts dkms-nvidia
→ Processing Conflict: 3:nvidia-driver-418.67-4.el7.x86_64 conflicts nvidia-x11-drv
→ Processing Conflict: 3:dkms-nvidia-418.67-1.el7.x86_64 conflicts nvidia-kmod
→ Finished Dependency Resolution
Error: nvidia-driver conflicts with nvidia-x11-drv-390.67-1.el7_5.elrepo.x86_64
Error: nvidia-x11-drv conflicts with 3:dkms-nvidia-418.67-1.el7.x86_64
Error: dkms-nvidia conflicts with kmod-nvidia-390.67-1.el7_5.elrepo.x86_64
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest

Should I try to skip and add --skip-broken to the yum install cuda-drivers?

I will attach the log file.
nvidia-bug-report.log.gz (75.8 KB)

No, the previous driver was installed from Elrepo but without adding the repo itself. So it’s better to put this into a sane state so the driver will be updated alongside the system.
Please run

rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
yum install https://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
yum install kmod-nvidia nvidia-x11-drv

OK, Done. I rebooted but my gpu is still not accessible. Should I go back and type:

ii) yum clean all
iii) yum install cuda-drivers
iv) reboot

I will attach my new bug report to this dialog.

nvidia-bug-report.log.gz (75.9 KB)

In no case forcibly install the 418 driver, it has a different layout than the 390 driver so it will partially overwrite and partially mix file versions.
Please post the output of
yum list installed |grep nvidia

OK. Here is the output you requested.

[root@MSSDR17041103 testing]# yum list installed |grep nvidia
Loaded plugins: langpacks, nvidia, product-id, rhnplugin, search-disabled-repos,
kmod-nvidia.x86_64 390.67-1.el7_5.elrepo installed
nvidia-diag-driver-local-repo-rhel7-418.67.x86_64
nvidia-x11-drv.x86_64 390.67-1.el7_5.elrepo installed
yum-plugin-nvidia.noarch 1.0.2-1.el7.elrepo installed

Is there a better way to test each GPU besides using my compiled C++ code? Does nvidia have a utility to acquire GPU attributes? I am not running the X server.

Do I need to execute nvidia-modprobe once after installing the new drivers?

Very interesting. Did you get any errors when you added the Elrepo repo?
Please remove the driver, then reinstall it:

yum remove "*nvidia*"
yum --disablerepo=\* --enablerepo=elrepo install kmod-nvidia nvidia-x11-drv

I removed the nvidia successfully with no errors.

I received several errors with the install. I copied the output to a file and attached it to this dialog.
nvidia_install_errors_Jun13.txt (58.6 KB)

Ok, you’re still running Centos 7.5 but elrepo seems to have dropped support for it, only providing packages for Centos 7.6 and up.
Since you now got rid of the old driver that was blocking everything, you now have two options:

  • Upgrade to Centos 7.6, then install the driver from elrepo
    or
  • install the driver you downloaded

For a short-term solution you can use the second option but in the long run, e.g. on Centos upgrade, you will likely run into problems so you will then have to remove the driver again and use option 1.

I finally got RH 7.6 installed per generix’s request.
I can see both NVidia cards in the PCI bus (GPU0 is the M5000 and GPU1 is the P4).

I need to install cuda in /usr/local/cuda10.1 but I am still having many installation conflicts. I have an HP Z840.

Should I use 418, 390, or 410 nvidia drivers with RedHat7.6?
I ask this because I am getting nvidia conflicts between 410 and 418.

Is there a work around to installing Cuda? I think my nvidia drivers are installed properly but I just need cuda.

below is the yum list installed | grep nvidia:

Loaded plugins: langpacks, nvidia, product-id, rhnplugin, search-disabled-repos,
kmod-nvidia.x86_64 410.93-1.el7_6.elrepo @aces_elrepo_rhel7.6_64_prod
nvidia-x11-drv.x86_64 410.93-1.el7_6.elrepo @aces_elrepo_rhel7.6_64_prod
nvidia-x11-drv-libs.x86_64 410.93-1.el7_6.elrepo @aces_elrepo_rhel7.6_64_prod
yum-plugin-nvidia.noarch 1.0.2-1.el7.elrepo installed

See attached for the latest bug report.
nvidia-bug-report_June25.log.gz (1.51 MB)

Cuda 10.1 needs driver 418 minimum. How did you now install the 410 driver?

yum --disablerepo=\* --enablerepo=elrepo install kmod-nvidia nvidia-x11-drv

should install the latest 430 driver?

When the correct driver is running, you can download the rpm at
https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=RHEL&target_version=7&target_type=rpmnetwork
and then don’t install cuda but cuda-toolkit

sudo rpm -i cuda-repo-rhel7-10.1.168-1.x86_64.rpm
sudo yum clean all
sudo yum install cuda-toolkit-10-1

I tried:

yum --disablerepo=* --enablerepo=elrepo install kmod-nvidia nvidia-x11-drv

but I got the error:

[root@MSSDR17041103 tcurry3]# yum --disablerepo=* --enablerepo=elrepo install kmod-nvidia nvidia-x11-drv
Loaded plugins: langpacks, nvidia, product-id, rhnplugin, search-disabled-repos,
: subscription-manager, versionlock
This system is receiving updates from RHN Classic or Red Hat Satellite.

Error getting repository data for elrepo, repository not found

Please re-add it by using the commands from post #9.