Previous working scenario: Ubuntu Boot screen or login screen working, cuda/ nvidia driver’s no error.
Suddenly while training a deep learning model the script crashes (P.S: The same script has worked before w/o any issues). When I tried re-booting the device, The boot screen or login screen does not appear on Linux. However, I am able to login via ssh using Putty. But I had to re-install Cuda, cudnn and nvidia-driver again to start the training process.
Only when I purge the nvidia drivers, sometimes the login screen appears. But as soon as I install the new version of driver (nvidia-430) the login screen does not appear on boot.
Earlier this was not the case. Even with nvidia-driver installed (Same version) I was able to see the login screen, but currently something has gone wrong which I am not able to understand. I am also not able to figure out the cause of this situation.
In addition, I have tried re-installing lightdm, xorg-xserver*, gdm, etc,. but nothing worked.
Attaching nvidia-bug-report.log.gz and journalctl
Just to add to the above:
My training was going on, and at step 2170 I got this error on my training scrceen(putty login):
2019-09-27 15:17:41.007879: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019-09-27 15:17:41.007910: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
The Linux screen also displayed the error as shown in linux_screen_error.jpg attachment
nvidia-bug-report.log.gz (33.5 MB)
journalctl.log (6.61 MB)