Previous working scenario: Ubuntu Boot screen or login screen working, cuda/ nvidia driver’s no error.
Suddenly while training a deep learning model the script crashes (P.S: The same script has worked before w/o any issues). When I tried re-booting the device, The boot screen or login screen does not appear on Linux. However, I am able to login via ssh using Putty. But I had to re-install Cuda, cudnn and nvidia-driver again to start the training process.
Only when I purge the nvidia drivers, sometimes the login screen appears. But as soon as I install the new version of driver (nvidia-430) the login screen does not appear on boot.
Earlier this was not the case. Even with nvidia-driver installed (Same version) I was able to see the login screen, but currently something has gone wrong which I am not able to understand. I am also not able to figure out the cause of this situation.
In addition, I have tried re-installing lightdm, xorg-xserver*, gdm, etc,. but nothing worked.
Attaching nvidia-bug-report.log.gz and journalctl
Just to add to the above:
My training was going on, and at step 2170 I got this error on my training scrceen(putty login):
2019-09-27 15:17:41.007879: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019-09-27 15:17:41.007910: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
The Linux screen also displayed the error as shown in linux_screen_error.jpg attachment nvidia-bug-report.log.gz (33.5 MB) journalctl.log (6.61 MB)
I did as it was mentioned in the link.
I haven’t been able to check the desktop to see if the screen is up (I am remote logging in from home) but “nvidia-smi” is not responding and my deep learning model also hangs while training.
Do you know what could be going wrong?
Also, what is the cause for the ppa driver package to break?
Please create a new nvidia-bug-report.log after applying the workaround.
Ubuntu 16.04 is not using glvnd so they used the non-glvnd compat libs which are known to be broken with 430 and removed with 435.
PFA the nvidia-bug-report.log after the workaround. (The system freezes while running nvidia-bug-report.sh, but I have attached the reported that was generated.)
Thank you for helping. Looking forward to your reply.
I checked my screen and there is no login screen yet, the screen is completely blank with a dark blue color.
I can only ssh into it, but “nvidia-smi” or the tensorflow model are still not working.
Requesting immediate help from NVIDIA or anyone who is familiar with the situation.
I am not sure what happened. I did the same steps as mentioned by you (generix) earlier again, and now I can not only see the screen but nvidia-smi and the training both have started working. I am going to wait for some more steps and see if it breaks at any point of time. If it doesn’t then it looks like whatever fix was mentioned by @generix worked (and I will be more than happy to accept that as the answer).