PC intermittently freezes and crashes due to NVIDIA drivers

Hello

I am a student at an engineering college. We recently got 2 Dell Precision 3660 Desktops with Intel i9 13th Gen CPUs, NVIDIA RTX A5000s and 32 GBs of RAM.

Initially, both PCs were crashing quite a bit. However, after some updates, one of them stopped crashing while the other one continued to crash. We thought the issue was with our installation of Windows Server - however even after switching the crashing PC to Linux instead of Windows (Ubuntu 22.04) the crashing continued.

I had an initial hunch that the issue was with the RAM or CPU and so I stress-tested them but both of them turned up no issues. Then I checked the system logs with dmesg after it randomly crashed another time and there was an error related to NVIDIA drivers there. More specifically, these lines

[ 18.024606] [drm:nv_drm_master_set [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000100] Failed to grab modeset ownership
[ 21.493118] pcieport 0000:00:06.0: AER: Uncorrectable (Non-Fatal) error message received from 0000:02:00.0
[ 21.493143] nvme 0000:02:00.0: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 21.493152] nvme 0000:02:00.0: device [1344:5415] error status/mask=00004000/00400000

So, this means the issue is either coming from an NVIDIA driver or from my NVMe SSD. I am attaching the bug report log that I got after running the bug report generation tool on Linux.

Side note, the other PC that is not crashing is running NVIDIA Driver 560.94 while the other PC that is crashing is running 550.120. However, Ubuntu drivers says that the crashed PC has fully updated drivers. I am going to try to install the 560 drivers on the Linux PC as well.

Our issue is very similar to this one in terms of symptoms after a crash and also the GPU is the same.

nvidia-bug-report.log (1.7 MB)