Ubuntu 18.04 with 2 RTX 2080 Ti screen frozen when training deep learning models

Yiang · September 23, 2019, 7:12am

Dear all,
I experienced for several months with my RTXs 2080Ti and Ubuntu 18.04. I really appreciate help from you ASAP.
Here is my configuration:

CPU : Intel i7 8700
RAM : 64 GB SSD
DISK : 1 TB SSD
Cooling : Fan Cooling

Current Driver 415.27 (have tried all different drivers)
Cuda: 10.0 with cudnn
Ubuntu 18.04

Problem: Ubuntu freeze during AI calculation, can’t even ssh into it, output host down.
Thus I have to run bug report after reboot, not sure if it’s ideal.

Attached my bug report after reboot.
nvidia-bug-report.log.gz (1.64 MB)

generix · September 23, 2019, 8:53am

Unfortunately, no errors are visible in the logs. Maybe check this first:
[url]USING CUDA AND X | NVIDIA

Yiang · September 23, 2019, 5:02pm

Hi generix,

Thanks for reply! The weird thing is I connect two monitors to gpu:0, while training I set os environment to visible to gpu:1 only and perform training process. Even that, the screen got frozen as well. Wondering if you have any thoughts on that, thanks !

Best,
Yiang

generix · September 23, 2019, 5:06pm

It’s rather a guessing game without error, but I think you should use gpu-burn to rule out a broken gpu.

Yiang · September 23, 2019, 5:30pm

Thanks for suggestion, I tested yesterday with gpu-burn for 60 seconds, test result is both are OK.
Wondering if there is a way I can generate the error log. Meanwhile I will try all options on that list, but I don’t think that’ll generate error logs.

Thanks,

generix · September 23, 2019, 5:39pm

Does a REISUB work to reboot, to get some log-sync?

Yiang · September 23, 2019, 6:07pm

Thanks, I’ll google and try that tonight.

Yiang · September 24, 2019, 4:22am

Hi generix,

I tried REISUB, it seems everything is frozen, didn’t work. I cleaned the dust a bit and added some thermal paste to the cpu to cool down cpu temp and checked the temp of cores are around ~80 when fully running.

Tried to reproduce the problem, it still happens, while gpu 73 degree C and cpus 80 degrees C.

I’m attaching here all the logs before and after reboot(press the button), seems like there is something, could you help have a look, thanks!

log_before_restart.txt (36.2 KB)
syslog.txt (722 KB)

log messages.txt (6.16 KB)

generix · September 26, 2019, 9:40am

Nothing at all in the logs. Problems due to overheating/power would be quite easy to discover in most cases. Since it completely falls dead in an instance, I’d rather suspect there’s something wrong with the mainboard. Your bios is really old, you should upgrade it.

Yiang · September 26, 2019, 5:57pm

Thanks for help! When I run sudo dmidecode --string bios-release-date, it shows 09/28/2018, is it too old?

generix · September 27, 2019, 7:36am

That was from when the board was released, MSI released five updates afterwards.

Yiang · September 30, 2019, 3:25am

Hi generix,

I updated cuda 10.0 → 10.1 with newest cudnn and also updated to newest driver 430.50.
changed python → 3.7, still no luck.
I’m still puzzle if it’s hardware problem or software problem…

generix · September 30, 2019, 9:55am

A software problem leading to that kind of hard freeze is unlikely. I’d rather strip the system, running only one gpu and no X, then check if it works, also changing slots and gpus.

Yiang · September 30, 2019, 6:23pm

Thanks, I already took one gpu out and only running one gpu, switch gpu doesn’t affect the freeze appear too much. That makes think it’s not a gpu problem.
Can you give me some guidance on what do you mean by no X?
Wondering what could be the reason… cpu? memory? motherboard? ssd?
Is there a reliable way I can test these easily?

generix · September 30, 2019, 8:57pm

“No X” means stopping the xserver.
I’d check the PSU first by swapping in a different model.

Yiang · October 4, 2019, 5:27pm

It turns out to be one of the ram chip is defective, casing that hard freezing. thanks for your help!

Topic		Replies	Views
Ubuntu 18.04 with 2 RTX 2080 Ti system frozen when training deep learning models using cuda CUDA Programming and Performance	28	3541	March 23, 2020
Ubuntu 18.04 freezed when using gpu-burn on RTX2080 Ti Linux	1	1040	December 9, 2019
Ubuntu 18.04 with 4 RTX 2080 Ti boot issue & freeze & CUDA errors Linux	26	5601	October 12, 2021
Ubuntu 18.04 and RTX 2080 SUPER systematically freezing Linux cuda , tensorflow , ubuntu	27	4064	October 12, 2021
Ubuntu 18.04 completely freezes after a few minutes of being booted Linux	25	18653	October 8, 2021
Repeated system crash Ubuntu 22.04 2080Ti Linux	3	604	November 19, 2023
410.66 crash and system freeze under heavy load (Xid 8, Xid 38) Linux	13	2104	November 15, 2018
Ubuntu 18.04 and GPU Nvidia Geforce GTX 1650 driver 440.33.01 screen freezes with videos and running a YOLO network Linux ubuntu	11	1794	March 28, 2023
RTX 2080 cards crashed when training longer a PyTorch model Linux	4	1185	November 6, 2019
GPU is lost when running Deep Learning codes on Ubuntu16.04 with two GTX TITAN X Linux	6	1789	February 8, 2018

Ubuntu 18.04 with 2 RTX 2080 Ti screen frozen when training deep learning models

Related topics