Nvidia driver error on Ubuntu 16.04

Hi,
I was running CUDA 9.1 with NVIDIA driver v396. Everythign was running just fine, till my application suddenly crashed (syslog during the crash is at the bottom of this post). Ever since that message, nvidia-smi returns no devices were found. I’ve inserted a few more key system parameters below.

Aug 19 17:33:09 edge00010 kernel: [  255.375582] NVRM: RmInitAdapter failed! (0x25:0x65:1101)
Aug 19 17:33:09 edge00010 kernel: [  255.375650] NVRM: rm_init_adapter failed for device bearing minor number 0
dekkio@edge00010 ~ $ dpkg -l| grep nvidia
ii  nvidia-396                                  396.51-0ubuntu0~gpu16.04.1                 amd64        NVIDIA binary driver - version 396.51
dekkio@edge00010 ~ $ nvidia-smi
No devices were found
dekkio@edge00010 ~ $ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85
dekkio@edge00010 ~ $ lspci | grep nvidia
dekkio@edge00010 ~ $ uname -a
Linux edge00010 4.8.0-53-generic #56~16.04.1-Ubuntu SMP Tue May 16 01:18:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

OS : Ubuntu 16.04 (Linux Mint 18.2 Sonya)

I have tried the following…

  1. I purged nvidia* and tried reinstalling the drivers. The drivers did seemt o install but when I reboot I get the same problem.

  2. I tried going back to driver release 384, same effect.

I read somethign about changing boot options but not sure how to do it with the command line (this is a server I am ssh-ing into).

Error message during the crash

Aug 18 13:43:43 edge00010 video_main[20644]: RuntimeError:
Aug 18 13:43:43 edge00010 video_main[20644]: Error:
Aug 18 13:43:43 edge00010 video_main[20644]: Cuda check failed (6 vs. 0): the launch timed out and was terminated
Aug 18 13:43:43 edge00010 video_main[20644]: Coming from:
Aug 18 13:43:47 edge00010 kernel: [344077.152249] NVRM: GPU at PCI:0000:01:00: GPU-9a3b10de-8f2e-2bfb-7119-f7c30441b247
Aug 18 13:43:47 edge00010 kernel: [344077.152255] NVRM: GPU Board Serial Number: 
Aug 18 13:43:47 edge00010 kernel: [344077.152256] NVRM: Xid (PCI:0000:01:00): 8, Channel 0000001b
Aug 18 13:43:47 edge00010 kernel: [344079.157888] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Aug 18 13:43:47 edge00010 kernel: [344081.163574] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Update: I have tried setting bootmode to pcie_aspm=off as suggested in [url]this[https://stackoverflow.com/questions/46107222/nvrm-rminitadapter-failed] thread. No effect. Same problem.

More data.

I purged nvidia and cuda drivers and verified taht I don’t get the RmInitAdapter failed messages.

I then installed just the nvidia drivers. I have tried both 396 and 384. In both cases, I see the adapter failure messages again, and nvidia-smi says “No devices found”.

Is this just a hardware failure then? Is there some other way to confirm?