Computer intermittently freezing due to nvidia driver

Hello,

I have a Lenovo P350 desktop that comes with an RTX A5000 GPU. It is running the latest version of Ubuntu 22.04 LTS. Since I bought it few months ago, I am experiencing intermittent freezing. I would leave my computer executing a task or idling (this does not seem to matter) and it would freeze (the screen would still be a screen on display but it would be frozen, my mouse would have a light on but would not move, and my keyboard would be on but would not click). The freezing seems to be random, meaning it does not happen when executing a certain task (the PC could be just idling) and it happens at random intervals (sometimes a week passes with no freezes and sometimes it could happen in a day). After multiple troubleshooting options (memory test, different kernel versions, reinstalling the OS), I decided to remove my NVIDIA driver as it has been causing me troubles since I bought the PC (more on this below) and the issue is fixed, I have not experienced any freezes in a month.
Concerning the troubles I was having with installing the drivers since I bought the PC, most drivers would not work, after installing them and restarting the PC I would be stuck on an error on boot or simply a black screen, to boot up my computer I would go to recovery mood and purge everything.

Some extra background:
My last two attempts went as follows:
I had the 530 driver installed and it would work (minus the freezing), I was checking the system logs that happened before the freeze and I had a “failed to grab modeset ownership” error, I am not entirely sure if it is related as the freezing happened over night so this could be an irrelevant error. Also, this was the only time I noticed this error. I tried changing my driver to the recommended one from the Ubuntu additional drivers software, I installed the 535 server driver and I was stuck on the Lenovo logo for a while (if it matters my loglevel is 3 in the ACPI boot options). After around 20 minutes I force shutdown my PC and powered it on again, it got past the lenovo logo but was stuck on a black screen with “-” on it. I had to purge everything in recovery and try another driver. I tried the 535 non-server driver and it was stuck on a black screen with an error saying “hdaudiocod2 unable to configure disabling”, this was the first time I experience this error. I purged everything, and moving to a different driver was problematic (I would get an X or an unmet dependencies error). I ended up successfully installing the driver from the official website here: Linux x64 (AMD64/EM64T) Display Driver | 535.54.03 | Linux 64-bit | NVIDIA this is working but freezes are happening.

Your help is highly appreciated.

Hello @thoumisergio and welcome to the NVIDIA developer forums!

If I interpret your explanation correctly your PC was running correctly with driver v530 but then suddenly started to exhibit the freezes? Can you somehow run the GPU in a different machine and maybe Windows to check if it is possibly a Hardware defect?

In any case, could you attach the output of nvidia-bug-report.sh here so we can check how the driver installation looks and if there is suspicious log behavior?

Thanks!

Hello,

Sorry for the confusion. My PC was never running correctly. This is the case since I bought it brand new. In the best case, it would run with intermittent freezing. In the worst case, the driver would not work at all. Here is my bug report as requested.
nvidia-bug-report.log (2.5 MB)

Thank you!

@MarkusHoHo As an important update, my PC froze during the weekend. Here is a screenshot of the important system logs before the freeze as well as the bug report after the freeze. There does not seem to be anything of relevance in the system logs.
Note that this time the screen went black (this happens but less frequenly, the more common freezing has the screen stuck on where it was at last, other “symptoms” are the same).

Thank you!

nvidia-bug-report.log (4.5 MB)

Hi again,

Sorry that I didn’t reply earlier, I was away for some time.

After reading your further details and the logs etc. I you have the wrong driver installed.

there are several indicators in the nvidia-bug-report.log

  • you seem to be using the Open Kernel module, which is not officially supported for this GPU
nvidia-uvm: Loaded the UVM driver, major device number 506.
NVRM objClInitPcieChipset: *** Chipset Setup Function Error!
NVRM: Open nvidia.ko is only ready for use on Data Center GPUs.
NVRM: To force use of Open nvidia.ko on other GPUs, see the
NVRM: 'OpenRmEnableUnsupportedGpus' kernel module parameter described
NVRM: in the README.
  • the kernel module looks like it is not authenticated correctly and taints certain libraries
nvidia: loading out-of-tree module taints kernel.
nvidia: module license 'NVIDIA' taints kernel.
Disabling lock debugging due to kernel taint
Creating 1 MTD partitions on "0000:00:1f.5":
0x000000000000-0x000002000000 : "BIOS"
nvidia: module verification failed: signature and/or required key missing - tainting kernel

I recommend you purge all NVIDIA drivers and install the correct one which you can download from Official Drivers | NVIDIA

Thanks!

Hello Markus,

Thank you for your answer. I already did that multiple times, the first bug report uploaded was generated after purging all drivers and installing the following Linux x64 (AMD64/EM64T) Display Driver | 535.54.03 | Linux 64-bit | NVIDIA

I am sorry, but the Open kernel module seems to say otherwise.

Did you unload all kernel modules before uninstalling? Did you reboot after the purge?

Was Linux pre-installed on the Desktop? If so, I would at this point even recommend an RMA with the manufacturer because it could also be a Hardware defect.

Beyond that, I am running out of ideas.

I could repeat that just in case.

Did you unload all kernel modules before uninstalling? No.

Did you reboot after the purge? No.

All I did was run the command " sudo apt-get remove --purge ‘^nvidia-.*’" then go the website, download the driver, and install it. If this is the wrong way to do things can you kindly let me know of the exact steps I should take?

Was Linux pre-installed on the Desktop? Yes, however, I reinstalled it in an attempt to fix this issue if that matters. After reinstallation, I installed the driver from Ubuntu additional drivers and not the website (had to go through multiple ones as some would not install or would cause the PC to be stuck on a black screen on boot), the driver was working but intermittent freezing was happening.

Moreover, when installing a driver from the website I have to use " --dkms" or the driver will be deleted after restarting my PC. Not sure if that matters.

I recommend you return the Desktop and have the original vendor fix this for you. Given how expensive this GPU is it has to work as advertised. If it does not you should not try to fix it yourself if it is still this new.

At this stage from all the information you have given and steps you have described the whole system is in a state that would take much more than a couple of forum posts to fix. Unless you start fro scratch, meaning formatting all drives and re-installing Ubuntu. Which might open all kinds of new issues since the original installation could have included settings specific to your pre-built system.

Please talk to your vendor.

Hi @thomisergio!
I have the same error as you, freezing and not being able to control the mouse or Numlock.
I thought it was related to Ram or anything hard drive, but now I think it is exactly the graphics card
My computer uses RTX A4000 Card, Windows 10 64 bit built on HP Z8 G4, and the error randomly occurs on the 552.86 driver, maybe 2, 3 days or maybe 2, 3 hours
I am currently downgrading to a lower version of 552.55 and monitoring it.
I currently cannot know exactly which part is the error to be able to warranty.
Thanks for any help and knowing exactly where the problem is

The manufacturer changed my GPU and motherboard and the error is still the same. Even with newer driver versions.

This problem is really dangerous.
Replaced GPU and Mainboard but did not solve the problem?
Then you try to remove some Ram and check.
Please let me know the result, I am really confused and do not understand what is going on, I just think of downgrading the drivers sequentially to see how it is