Dear all,
I experienced for several months with my RTXs 2080Ti and Ubuntu 18.04. I really appreciate help from you ASAP.
Here is my configuration:
CPU : Intel i7 8700
RAM : 64 GB SSD
DISK : 1 TB SSD
Cooling : Fan Cooling
Current Driver 415.27 (have tried all different drivers)
Cuda: 10.0 with cudnn
Ubuntu 18.04
Problem: Ubuntu freeze during AI calculation, can’t even ssh into it, output host down.
Thus I have to run bug report after reboot, not sure if it’s ideal.
Thanks for reply! The weird thing is I connect two monitors to gpu:0, while training I set os environment to visible to gpu:1 only and perform training process. Even that, the screen got frozen as well. Wondering if you have any thoughts on that, thanks !
Thanks for suggestion, I tested yesterday with gpu-burn for 60 seconds, test result is both are OK.
Wondering if there is a way I can generate the error log. Meanwhile I will try all options on that list, but I don’t think that’ll generate error logs.
I tried REISUB, it seems everything is frozen, didn’t work. I cleaned the dust a bit and added some thermal paste to the cpu to cool down cpu temp and checked the temp of cores are around ~80 when fully running.
Tried to reproduce the problem, it still happens, while gpu 73 degree C and cpus 80 degrees C.
I’m attaching here all the logs before and after reboot(press the button), seems like there is something, could you help have a look, thanks!
Nothing at all in the logs. Problems due to overheating/power would be quite easy to discover in most cases. Since it completely falls dead in an instance, I’d rather suspect there’s something wrong with the mainboard. Your bios is really old, you should upgrade it.
I updated cuda 10.0 → 10.1 with newest cudnn and also updated to newest driver 430.50.
changed python → 3.7, still no luck.
I’m still puzzle if it’s hardware problem or software problem…
A software problem leading to that kind of hard freeze is unlikely. I’d rather strip the system, running only one gpu and no X, then check if it works, also changing slots and gpus.
Thanks, I already took one gpu out and only running one gpu, switch gpu doesn’t affect the freeze appear too much. That makes think it’s not a gpu problem.
Can you give me some guidance on what do you mean by no X?
Wondering what could be the reason… cpu? memory? motherboard? ssd?
Is there a reliable way I can test these easily?