I experienced following issue for several months with my RTXs 2080Ti and Ubuntu 18.04. I really appreciate some help ASAP.
The problem is when I train deep learning model, the whole system would be frozen at some point, and I could not either ssh into it nor run REISUB to get some logs info, I guess things happened too quick before anything was logged? Can any one give me some suggestions on what I should/can do, thanks a lot!
Here is my configuration:
- CPU : Intel i7 8700
- RAM : 64 GB SSD
- DISK : 1 TB SSD
- Cooling : Fan Cooling
Current Driver 415.27 (have tried some different drivers)
Cuda: 10.0 with cudnn
Problems can be reproduced while running image segmentations in pytorch.