Hi,
I was running CUDA 9.1 with NVIDIA driver v396. Everythign was running just fine, till my application suddenly crashed (syslog during the crash is at the bottom of this post). Ever since that message, nvidia-smi returns no devices were found. I’ve inserted a few more key system parameters below.
Aug 19 17:33:09 edge00010 kernel: [ 255.375582] NVRM: RmInitAdapter failed! (0x25:0x65:1101)
Aug 19 17:33:09 edge00010 kernel: [ 255.375650] NVRM: rm_init_adapter failed for device bearing minor number 0
dekkio@edge00010 ~ $ dpkg -l| grep nvidia
ii nvidia-396 396.51-0ubuntu0~gpu16.04.1 amd64 NVIDIA binary driver - version 396.51
dekkio@edge00010 ~ $ nvidia-smi
No devices were found
dekkio@edge00010 ~ $ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85
dekkio@edge00010 ~ $ lspci | grep nvidia
dekkio@edge00010 ~ $ uname -a
Linux edge00010 4.8.0-53-generic #56~16.04.1-Ubuntu SMP Tue May 16 01:18:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
OS : Ubuntu 16.04 (Linux Mint 18.2 Sonya)
I have tried the following…
-
I purged nvidia* and tried reinstalling the drivers. The drivers did seemt o install but when I reboot I get the same problem.
-
I tried going back to driver release 384, same effect.
I read somethign about changing boot options but not sure how to do it with the command line (this is a server I am ssh-ing into).
Error message during the crash
Aug 18 13:43:43 edge00010 video_main[20644]: RuntimeError:
Aug 18 13:43:43 edge00010 video_main[20644]: Error:
Aug 18 13:43:43 edge00010 video_main[20644]: Cuda check failed (6 vs. 0): the launch timed out and was terminated
Aug 18 13:43:43 edge00010 video_main[20644]: Coming from:
Aug 18 13:43:47 edge00010 kernel: [344077.152249] NVRM: GPU at PCI:0000:01:00: GPU-9a3b10de-8f2e-2bfb-7119-f7c30441b247
Aug 18 13:43:47 edge00010 kernel: [344077.152255] NVRM: GPU Board Serial Number:
Aug 18 13:43:47 edge00010 kernel: [344077.152256] NVRM: Xid (PCI:0000:01:00): 8, Channel 0000001b
Aug 18 13:43:47 edge00010 kernel: [344079.157888] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Aug 18 13:43:47 edge00010 kernel: [344081.163574] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
…