Problem with 550.54.14 + NVIDIA RTX A6000

We have two researchers with PCs running RHEL 9. Both systems stop working with version 550.54.14 drivers from nvidia’s dnf repo at:

https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/

When we downgrade to 545.23.08, everything starts working again. Both PCs have NVIDIA RTX A6000 cards which appear to still be supported. Other systems we have here with different Nvidia GPUs are nto affected. Only machines with one or more NVIDIA RTX A6000 cards. One one of the affected machines I ran nvidia-bug-report.sh under both 545.23.08 and 550.54.14. When I ran it on 550.54.14 it crashed the machine and it restarted. But it must have produced a partial log.
nvidia-bug-report-545.23.08.log.gz (604.8 KB)
nvidia-bug-report-550.54.14.log.gz (172.7 KB)

That’s a pretty critical bug, please also mail it to linux-bugs[at]nvidia.com

We just had another user with this issues, when using 2 A6000.

Please try to fix this soonish!

Thanks for highlighting issue to us. I have filed a bug 4566319 internally for tracking purpose.
@ptr1337
Can you please share nvidia bug report.

I can also attest to issues with the latest drivers and a6000s. Problems showing up on machines using intel & amd processors, tested on both, both using a6k’s.

In addition to the original posters issues, we’re also seeing issues extending into the OS displays, with icon not showing up, words half displaying, and in kde/plasma w/ wayland which implemented sync, experiencing constant backtracking and flickering issues on many apps.
nvidia-bug-report.log.gz (1.0 MB) ← this report is in X11 which has less flickering issues but still the main driver issues mentioned by original poster.

This is a bug report generated in Wayland for comparison:
nvidia-bug-report.log.gz (1.0 MB)

Hi @gsgatlin
I was able to repro issue locally and we will be able to debug it now.

Hi,

is there any ETA to get a fix into the 550 driver?

Were currently in a really bad situation for these enterprise cards, since the 535 does only support CUDA 12.2 and our distribution does ship CUDA 12.4 as default.
The 545 (last working) driver lacks in support with the 6.8 Kernel currently and needs to be patched/fixed.

This should be really fixed as fast as possible, since these cards a majorly thought for CUDA work.

Any news on if this has been fixed? Thanks.

Issue has been root caused and incorporated in latest released driver.
https://us.download.nvidia.com/XFree86/Linux-x86_64/550.78/NVIDIA-Linux-x86_64-550.78.run