Ubuntu 18.04 with 2 RTX 2080 Ti system frozen when training deep learning models using cuda

Yiang · September 26, 2019, 6:27pm

Hi,
I experienced following issue for several months with my RTXs 2080Ti and Ubuntu 18.04. I really appreciate some help ASAP.
The problem is when I train deep learning model, the whole system would be frozen at some point, and I could not either ssh into it nor run REISUB to get some logs info, I guess things happened too quick before anything was logged? Can any one give me some suggestions on what I should/can do, thanks a lot!

Here is my configuration:

CPU : Intel i7 8700
RAM : 64 GB SSD
DISK : 1 TB SSD
Cooling : Fan Cooling

Current Driver 415.27 (have tried some different drivers)
Cuda: 10.0 with cudnn
Ubuntu 18.04

Problems can be reproduced while running image segmentations in pytorch.

Thanks

Robert_Crovella · September 26, 2019, 7:32pm

One common issue in these situations is insufficient power supply. That would be the main concern I would have on the hardware side.

On the software side try the latest driver you can download for your GPU. It should be compatible with whatever software stack you are using. (Currently: [url]Linux x64 (AMD64/EM64T) Display Driver | 430.50 | Linux 64-bit | NVIDIA )

njuffa · September 26, 2019, 8:03pm

How many RTX 2080Ti are in this system? Based on the stated system specs, you would want to size the power supply unit (PSU) roughly as follows to provide rock-solid operation under continuous high load:

1x RTX 2080Ti >= 650W
2x RTX 2080Ti >= 1050W
3x RTX 2080Ti >= 1500W
4x RTX 2080Ti >= 1900W

The wattages stated refer to the nominal wattage of the PSU. Note that in the US, you will likely be limited to 3 RTX 2080Ti, as standard home/office wall outlets won’t allow you to run more than a 1600W PSU from a single outlet. An 80 PLUS Platinum rated PSU would be advisable for a high-end workstation class machine from a reliability and efficiency perspective.

Each RTX 2080 Ti requires two 8-pin PCIe auxilliary power connectors. Do not use Y-splitters, converters, or daisy chaining for the auxilliary power cables.

Robert_Crovella · September 26, 2019, 8:07pm

The thread title appears to suggest 2.

njuffa · September 26, 2019, 8:12pm

Duh. Missed that :-)

Yiang · September 26, 2019, 8:17pm

Thanks for suggestions, I was using PSU with 1000W, as far as I remember same issue happened while I was using 1 GPU but less frequent. To reproduce the issue, I was only fully running one GPU ~ 270W and the other one was idle, while same frozen could be consistently reproduced. I guess I could try remove one of the GPU tonight, see If I can still reproduce that.

Another thing I noticed is
It happens more frequently when I set small batch size calculation, meaning gpu will finish each batch faster and get batches in a shorter time gap.

Will also try newer driver.
Question I have, should I also switch from cuda 10.0 → cuda 10.1 ? does it affect?

njuffa · September 26, 2019, 8:28pm

I listed rough numbers for the power supply. A 1000W PSU should be OK for a system with two GPUs. Make sure the power connectors are plugged in correctly and that the GPUs are firmly seated in the PCIe slots (they should be secured with screws or a push-down bar at the bracket so they cannot wiggle out of the PCIe slot).

You could try operating each GPU separately and also cycle the GPUs through the PCIe slots to check if the issue is related to a particular GPU or a particular PCIe slot.

Instability issues are notoriously hard to diagnose remotely. As Robert Crovella stated, the most frequent hardware-related issue is insufficient power supply. The second most frequent hardware-related issue is over-heating. The open-fan design NVIDIA chose for the RTX line typically causes lower GPU temperature in systems with a single GPU, but can cause issues with high GPU temperatures in systems where multiple GPUs operate in close vicinity, as air flow is impeded. What GPU temperatures are reported by nvidia-smi?

The Intel i7 8700 has only 16 PCIe lanes. I assume they are configured here as 2 x 8 to operate the two GPUs?

You would also want to address any potential issues on the software side, as Robert Crovella pointed out above.

Robert_Crovella · September 26, 2019, 8:30pm

You can put a newer driver on the machine without changing CUDA version or any other part of your software stack.

That’s what I meant when I said this:

Yiang · September 26, 2019, 8:45pm

Thanks!

The temperature of GPU and CPU while the frozen problem occurs.
one fully running GPU ~75-80 C
CPU cores 75-82 C

I’ll make sure everything connect to PSW stably tonight and try to reproduce it.

For the drivers, I guess I’ll try newest version tonight.

Will let you guys know any updates, thanks for help, it’s just a bit tough for me to debug that without any log info and I almost going to RMA them, but gpu-burn 4mins tests showing both are working okay and browsing sites, everything looks fine.

Robert_Crovella · September 26, 2019, 9:02pm

Sorting out these problems both on the HW and SW side is often a matter of careful trial.

You can put “if possible” after each of these:
Try a different PSU.
Try a different GPU.
Try a different motherboard.

Try a different GPU driver.
Try a different CUDA version.
Try a different Pytorch version.
Try a different cudnn version.

Or if new enough, RMA.

As a test, you might also try providing a lot of additional cooling via a powerful fan directed at the system and/or GPUs. It’s possible that the freeze has nothing to do with the GPUs.

njuffa · September 26, 2019, 10:12pm

The temperatures reported look OK.

Yiang · September 27, 2019, 9:51pm

Thanks guys for help, just some updates here.

I kind of resembled some of the components last night, basically unplug and plug some cables, didn’t see big loosen wires and ran the same code to reproduce the problem. Surprisingly it did not freeze , I don’t know if it’s a coincidence or I literally solved the issue, but I left the computer running the same segmentation model this morning, will see how it goes.
I kind of doubt it’s being 100% solved… maybe cause I’ve thought that too many times and was wrong.

Meanwhile did a more throughly test on cpu cores, on very intensive task, temp could go up to 90+ C, which I think in general is not too good, bought case Corsair Air 540 on Amazon, hope that would help for both cpu and gpu cooling airflow.

I guess I’ll just follow this list, if frozen happened again.
Try a different GPU driver.
Try a different CUDA version.
Try a different Pytorch version.
Try a different cudnn version.

Or maybe then
Try a different PSU.
Try a different GPU.
Try a different motherboard.
which seems to be more expensive to try :)

njuffa · September 27, 2019, 9:59pm

Actually, unplugging and replugging parts can sometimes fix weird issues for good (or at least for a lengthy period of time). The basic mechanism behind that is that issues with poor electrical contact may be remedied, e.g. an oxidation layer or dirt may be sufficiently removed from contacts in the process.

Before you install a new fan on your CPU I suggest checking for dust accumulation on the CPU heat sink and fan. Dust accumulation is a common problem that can easily raise CPU temperatures by 10 degrees compared to a clean cooling combo. The same issue can affect GPUs.

Yiang · September 28, 2019, 5:14am

Sadly, it happened again tonight, not in the long training process though, happened suddenly while performing short inference. Haven’t been able to easily reproduce that though so far.
Tough game…

njuffa · September 28, 2019, 9:44pm

You stated that the system “freezes” when it fails. How long did you wait before you decided that the system had frozen for good (rather than being temporarily inaccessible)? Did the system remain frozen or did it eventually reboot by itself?

If you use dmesg or similar syslog facilities, are there any indications of trouble in the run-up to these freezes?

Yiang · September 28, 2019, 10:37pm

When freeze happens, the system becomes unresponsive, the screen freezes at certain point, still display the last image, but frozen, while mouse can’t move and keyboard is disconnected (even press CapsLock won’t response). If unplug and replug DP cable, the same image would appear, in the same manner and frozen. I did wait for 10 mins or so, there’s no big change.

Yiang · September 29, 2019, 1:04am

I guess I’m not expert enough to understand all the logs, if there is anything you think would be useful, I’ll just attach them here. Thanks!

Results are run after reboot.
nvidia-bug-report.log.gz (1.11 MB)
log_messages.txt (7.55 KB)

Yiang · September 30, 2019, 3:22am

Hi guys,

I updated cuda 10.0 → 10.1 with newest cudnn and also updated to newest driver 430.50. changed python → 3.7.3
still no luck, I’m still puzzled if it’s hardware problem or software problem … lol
Can you guys give me some suggestions? Thanks a lot!

tera · September 30, 2019, 2:59pm

I don’t think this is going to help you in any way, but I just had the same problem with an RTX 2070 in a Ubuntu 16.04 system.
The problem got resolved when I switched the CUDA code to use an older GPU that was also present in the machine.

njuffa · September 30, 2019, 7:12pm

I am out of ideas, but based on the information presented this far I see no indications that the problem is caused by the GPUs.

Topic		Replies	Views
Ubuntu 18.04 and RTX 2080 SUPER systematically freezing Linux cuda , tensorflow , ubuntu	27	3749	October 12, 2021
Ubuntu 18.04 with 2 RTX 2080 Ti screen frozen when training deep learning models Linux	15	1343	October 4, 2019
Multi-GPU performance incredibly slow CUDA Programming and Performance	7	3048	January 2, 2020
cudaMemcpy Hung CUDA Programming and Performance	21	4100	May 30, 2019
Several failures when running Memory test on CentOS7 machine with 8 K80s. CUDA Programming and Performance	8	728	September 27, 2017
Black Screen After install CUDA 10.1 on Ubuntu 18.04 Linux	37	19725	November 30, 2022
Problems after inserting a P100 CUDA Setup and Installation	35	3844	October 31, 2021
GPU in state where results are not reproducible! CUDA Programming and Performance	50	16703	November 2, 2012
four 9800GX2 cards: will it work? CUDA Programming and Performance	33	23304	May 28, 2008
Strange freezes with Tesla C2050 - Help needed! Help needed!!!! CUDA Programming and Performance	63	7495	March 1, 2011

Ubuntu 18.04 with 2 RTX 2080 Ti system frozen when training deep learning models using cuda

Related topics