Ubuntu 18.04 with 2 RTX 2080 Ti screen frozen when training deep learning models

Dear all,
I experienced for several months with my RTXs 2080Ti and Ubuntu 18.04. I really appreciate help from you ASAP.
Here is my configuration:

  • CPU : Intel i7 8700
  • RAM : 64 GB SSD
  • DISK : 1 TB SSD
  • Cooling : Fan Cooling

Current Driver 415.27 (have tried all different drivers)
Cuda: 10.0 with cudnn
Ubuntu 18.04

Problem: Ubuntu freeze during AI calculation, can’t even ssh into it, output host down.
Thus I have to run bug report after reboot, not sure if it’s ideal.

Attached my bug report after reboot.
nvidia-bug-report.log.gz (1.64 MB)

Unfortunately, no errors are visible in the logs. Maybe check this first:
https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x

Hi generix,

Thanks for reply! The weird thing is I connect two monitors to gpu:0, while training I set os environment to visible to gpu:1 only and perform training process. Even that, the screen got frozen as well. Wondering if you have any thoughts on that, thanks !

Best,
Yiang

It’s rather a guessing game without error, but I think you should use gpu-burn to rule out a broken gpu.

Thanks for suggestion, I tested yesterday with gpu-burn for 60 seconds, test result is both are OK.
Wondering if there is a way I can generate the error log. Meanwhile I will try all options on that list, but I don’t think that’ll generate error logs.

Thanks,

Does a REISUB work to reboot, to get some log-sync?

Thanks, I’ll google and try that tonight.

Hi generix,

I tried REISUB, it seems everything is frozen, didn’t work. I cleaned the dust a bit and added some thermal paste to the cpu to cool down cpu temp and checked the temp of cores are around ~80 when fully running.

Tried to reproduce the problem, it still happens, while gpu 73 degree C and cpus 80 degrees C.

I’m attaching here all the logs before and after reboot(press the button), seems like there is something, could you help have a look, thanks!

log_before_restart.txt (36.2 KB)
syslog.txt (722 KB)

log messages.txt (6.16 KB)

Nothing at all in the logs. Problems due to overheating/power would be quite easy to discover in most cases. Since it completely falls dead in an instance, I’d rather suspect there’s something wrong with the mainboard. Your bios is really old, you should upgrade it.

Thanks for help! When I run sudo dmidecode --string bios-release-date, it shows 09/28/2018, is it too old?

That was from when the board was released, MSI released five updates afterwards.

Hi generix,

I updated cuda 10.0 --> 10.1 with newest cudnn and also updated to newest driver 430.50.
changed python --> 3.7, still no luck.
I’m still puzzle if it’s hardware problem or software problem…

A software problem leading to that kind of hard freeze is unlikely. I’d rather strip the system, running only one gpu and no X, then check if it works, also changing slots and gpus.

Thanks, I already took one gpu out and only running one gpu, switch gpu doesn’t affect the freeze appear too much. That makes think it’s not a gpu problem.
Can you give me some guidance on what do you mean by no X?
Wondering what could be the reason… cpu? memory? motherboard? ssd?
Is there a reliable way I can test these easily?

“No X” means stopping the xserver.
I’d check the PSU first by swapping in a different model.

It turns out to be one of the ram chip is defective, casing that hard freezing. thanks for your help!