Driver 410.57 for NVIDIA GeForce RTX 2080 Ti causes reboot, the run installer reports no driver installed

We have NVIDIA GeForce RTX 2080 Ti A1, on Alma linux 8.9 (variant of RedHat 8.9),

after installing NVIDIA-Linux-x86_64-410.57.run and CUDA from /etc/yum.repos.d/cuda-rhel8.repo
baseurl=Index of /compute/cuda/repos/rhel8/x86_64

We see some errors in the logs, and any attempt to run applications on GPU causes a system reboot.
I have tried the uninstall option to disable it, but it reports: There is no NVIDIA driver currently installed.

How could I remove this driver (I noticed one topic with the response 410.57 is outdated and won’t work with new kernels)? Which driver will work with this HW?
thank you,
Roozbeh

(variant of RedHat 8.9

Red Hat Enterprise Linux 8.9 is distributed with the kernel version 4.18.0-513.5.1

4.18 kernel is from -12 August 2018

NVIDIA-Linux-x86_64-410.57

from Linux AMD64 Display Driver Archive | NVIDIA - September 19, 2018

so it should work, if driver installed correctly

and any attempt to run applications on GPU causes a system reboot.

If you trying to run “new” CUDA app on old drivers - it will not work in most cases.
But maybe in your case is just installation error.
I have no idea why you need this old distro and drivers.

Which driver will work with this HW?

With RTX 2080 Ti A1 - latest 550+ driver will work.
But I have no idea if latest drive will work with kernel 4.18.

Obvious easy way - is just use latest Ubuntu and install latest driver there.

Thanks for suggestions, but we can’t use Ubuntu for CAD tools. It suits application that require latest Kernel. With CAD tools, we need more stable environment (and older kernels).
I was able to remove the CUDA packages that was installed from repo. After installing the latest package from NVIDIA site (550.78), I still get system reboots every now and then and not clear what the reason is.

We have one GPU in the system, but lspci reports 2, and I also see error logs about the 2nd one that probably does not exist (or nvidia driver does not see it) and I see the following error log:
[nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00001a00] Failed to allocate NvKms KapiDevice

nvidia-smi does not show this device: 0x00001a00. But reports the device: 0x00006800. Which according to log messages, gets loaded with no error.

The following is the debug report (/usr/bin/nvidia-bug-report.sh --safe-mode --output-file nvidia-bug-report.log):
nvidia-bug-report.log.gz (141.7 KB)

If you need more information, please let me know.
P.S. I had to use the --safe-mode since without it, the system crashed and rebooted. Again, not clear why.

The RHEL 4.18 kernel contains a lot of backports so it resemble more a 5.19 kernel so the 410 driver won’t compile.
This is also not necessary as there’s already a 550 driver installed. Using the runfile installer would rather break it.
The gpu at pci 1a:00.0 fails to initialize, I suspect it’s broken. Please try reseating it in its slot, check if it works in another system.
Furthermore, I don’t know why you used the cuda 8 repo, Turing based gpus need cuda 10 minimum.

410 was not installed as CUDA from cuda-rhel8.repo was installed which included the driver and cuda 12.4.
I had to remove it to install 550 driver, hoping that would fix the problem, since it also has cuda12.4. However, we still get reboots when the GPUs are accessed. Even running nvidia-smi causes it.

I am not sure if I understand you right. By cuda 8 repo, do you mean cuda-rhel8.repo? In that case, I don’t see how I could install it from cuda-rhel10.repo?!
More likely, it will not meet the dependency requirements on RH8 type of OS (AlmaLinux 8.9). The current repo does have CUDA 12.4, which is also in the NVIDIA 550 which I installed. If you have any suggestion on how to address the rebooting problem, I would be grateful.
I saw a discussion on Pytorch.org that might be useful; using nvidia-smi commands:
nvidia-smi -pm 1
nvidia-smi -lgc 1400 etc.
Random reboot during training - #7 by Saurabh_Bagalkar - PyTorch Forums

FYI, swapping the GPU cards and checking power connections caused the 2nd one to show up in nvidia-smi report and the 1st one which was active, now reports errors in the logs.
Going to verify the PCI slots and perhaps using another pair.

We tried putting the boards on different PCI slots, the same problem was observed. One is not detected (errors reported by driver in the logs), and the other one (or the driver) still causes the machine to reboot.

This machine used to work and ran CUDA applications on its previous OS (Ubuntu) without rebooting. Now on Alma linux8, after installing Nvidia/CUDA 12.4, it reboots even when it is not running applications GPU.

Any suggestions on how to remedy this?

Forget about me mentioning cuda 8, I misread your post.
I suspect there’s a bug in the 550 driver you have currently installed when used with dual gpu setups, please try a 535 driver.

Thank you for your follow up, is nvidia driver included in CUDA repo? (e.g. 535 driver?)
https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64

Or it has to be installed separately?

The rhel 8 repo has module streams https://developer.nvidia.com/blog/streamlining-nvidia-driver-deployment-on-rhel-8-with-modularity-streams/
so you should be able to use the 535-dkms stream.