Driver 410.57 for NVIDIA GeForce RTX 2080 Ti causes reboot, the run installer reports no driver installed

roozbeh · April 30, 2024, 6:45pm

We have NVIDIA GeForce RTX 2080 Ti A1, on Alma linux 8.9 (variant of RedHat 8.9),

after installing NVIDIA-Linux-x86_64-410.57.run and CUDA from /etc/yum.repos.d/cuda-rhel8.repo
baseurl=Index of /compute/cuda/repos/rhel8/x86_64

We see some errors in the logs, and any attempt to run applications on GPU causes a system reboot.
I have tried the uninstall option to disable it, but it reports: There is no NVIDIA driver currently installed.

How could I remove this driver (I noticed one topic with the response 410.57 is outdated and won’t work with new kernels)? Which driver will work with this HW?
thank you,
Roozbeh

s48gs.w · April 30, 2024, 11:21pm

(variant of RedHat 8.9

Red Hat Enterprise Linux 8.9 is distributed with the kernel version 4.18.0-513.5.1

4.18 kernel is from -12 August 2018

NVIDIA-Linux-x86_64-410.57

from Linux AMD64 Display Driver Archive | NVIDIA - September 19, 2018

so it should work, if driver installed correctly

and any attempt to run applications on GPU causes a system reboot.

If you trying to run “new” CUDA app on old drivers - it will not work in most cases.
But maybe in your case is just installation error.
I have no idea why you need this old distro and drivers.

Which driver will work with this HW?

With RTX 2080 Ti A1 - latest 550+ driver will work.
But I have no idea if latest drive will work with kernel 4.18.

Obvious easy way - is just use latest Ubuntu and install latest driver there.

roozbeh · May 1, 2024, 8:14am

Thanks for suggestions, but we can’t use Ubuntu for CAD tools. It suits application that require latest Kernel. With CAD tools, we need more stable environment (and older kernels).
I was able to remove the CUDA packages that was installed from repo. After installing the latest package from NVIDIA site (550.78), I still get system reboots every now and then and not clear what the reason is.

We have one GPU in the system, but lspci reports 2, and I also see error logs about the 2nd one that probably does not exist (or nvidia driver does not see it) and I see the following error log:
[nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00001a00] Failed to allocate NvKms KapiDevice

nvidia-smi does not show this device: 0x00001a00. But reports the device: 0x00006800. Which according to log messages, gets loaded with no error.

The following is the debug report (/usr/bin/nvidia-bug-report.sh --safe-mode --output-file nvidia-bug-report.log):
nvidia-bug-report.log.gz (141.7 KB)

If you need more information, please let me know.
P.S. I had to use the --safe-mode since without it, the system crashed and rebooted. Again, not clear why.

generix · May 1, 2024, 11:36am

The RHEL 4.18 kernel contains a lot of backports so it resemble more a 5.19 kernel so the 410 driver won’t compile.
This is also not necessary as there’s already a 550 driver installed. Using the runfile installer would rather break it.
The gpu at pci 1a:00.0 fails to initialize, I suspect it’s broken. Please try reseating it in its slot, check if it works in another system.
Furthermore, I don’t know why you used the cuda 8 repo, Turing based gpus need cuda 10 minimum.

roozbeh · May 1, 2024, 7:55pm

410 was not installed as CUDA from cuda-rhel8.repo was installed which included the driver and cuda 12.4.
I had to remove it to install 550 driver, hoping that would fix the problem, since it also has cuda12.4. However, we still get reboots when the GPUs are accessed. Even running nvidia-smi causes it.

roozbeh · May 1, 2024, 8:16pm

I am not sure if I understand you right. By cuda 8 repo, do you mean cuda-rhel8.repo? In that case, I don’t see how I could install it from cuda-rhel10.repo?!
More likely, it will not meet the dependency requirements on RH8 type of OS (AlmaLinux 8.9). The current repo does have CUDA 12.4, which is also in the NVIDIA 550 which I installed. If you have any suggestion on how to address the rebooting problem, I would be grateful.
I saw a discussion on Pytorch.org that might be useful; using nvidia-smi commands:
nvidia-smi -pm 1
nvidia-smi -lgc 1400 etc.
Random reboot during training - #7 by Saurabh_Bagalkar - PyTorch Forums

roozbeh · May 1, 2024, 10:15pm

FYI, swapping the GPU cards and checking power connections caused the 2nd one to show up in nvidia-smi report and the 1st one which was active, now reports errors in the logs.
Going to verify the PCI slots and perhaps using another pair.

roozbeh · May 1, 2024, 11:31pm

We tried putting the boards on different PCI slots, the same problem was observed. One is not detected (errors reported by driver in the logs), and the other one (or the driver) still causes the machine to reboot.

This machine used to work and ran CUDA applications on its previous OS (Ubuntu) without rebooting. Now on Alma linux8, after installing Nvidia/CUDA 12.4, it reboots even when it is not running applications GPU.

Any suggestions on how to remedy this?

generix · May 3, 2024, 9:58pm

Forget about me mentioning cuda 8, I misread your post.
I suspect there’s a bug in the 550 driver you have currently installed when used with dual gpu setups, please try a 535 driver.

roozbeh · May 3, 2024, 11:55pm

Thank you for your follow up, is nvidia driver included in CUDA repo? (e.g. 535 driver?)
https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64

Or it has to be installed separately?

generix · May 5, 2024, 2:51pm

The rhel 8 repo has module streams https://developer.nvidia.com/blog/streamlining-nvidia-driver-deployment-on-rhel-8-with-modularity-streams/
so you should be able to use the 535-dkms stream.

Topic		Replies	Views
RTX 2080 not read by ubuntu 18.04.1 Linux	21	24795	June 10, 2019
RTX 2080Ti can't install driver Linux hw , cuda , kernel	2	561	May 10, 2020
Install Cuda 10 Driver 410 ubuntu 16.04 Linux	9	734	May 11, 2021
NVIDIA 430.x driver CUDA 10.1 compatibility with GeForce rtx 2080ti Linux	2	3357	July 18, 2019
Issues installing RTX 2080 Ti drivers on Kali Linux Linux	9	2510	October 12, 2021
Cannot install Nvidia driver for RTX 2080 Ti in Ubuntu 18.04/16.04 LTS ! Linux	9	3423	April 17, 2019
Ubuntu 18.04.1 Bootable issue with RTX 2080Ti 410.78 Driver Linux	11	3356	February 24, 2019
Nvidia driver 525.60.13, confirm Red Hat (8.7) kernel support Linux cuda , kernel	5	1777	January 5, 2023
Issue to install driver GeForce 2080 RTX TI on Centos 7 with kernel 3.10 Linux	1	822	June 11, 2020
NVIDIA GeForce RTX 2080 Ti crashes CentOS 7.6.1810 Linux	1	1048	July 10, 2019

Driver 410.57 for NVIDIA GeForce RTX 2080 Ti causes reboot, the run installer reports no driver installed

Related topics