Ubuntu 18.04 with 2 RTX 2080 Ti system frozen when training deep learning models using cuda

Hi,
I experienced following issue for several months with my RTXs 2080Ti and Ubuntu 18.04. I really appreciate some help ASAP.
The problem is when I train deep learning model, the whole system would be frozen at some point, and I could not either ssh into it nor run REISUB to get some logs info, I guess things happened too quick before anything was logged? Can any one give me some suggestions on what I should/can do, thanks a lot!

Here is my configuration:

  • CPU : Intel i7 8700
  • RAM : 64 GB SSD
  • DISK : 1 TB SSD
  • Cooling : Fan Cooling

Current Driver 415.27 (have tried some different drivers)
Cuda: 10.0 with cudnn
Ubuntu 18.04

Problems can be reproduced while running image segmentations in pytorch.

Thanks

One common issue in these situations is insufficient power supply. That would be the main concern I would have on the hardware side.

On the software side try the latest driver you can download for your GPU. It should be compatible with whatever software stack you are using. (Currently: https://www.nvidia.com/Download/driverResults.aspx/151568/en-us )

How many RTX 2080Ti are in this system? Based on the stated system specs, you would want to size the power supply unit (PSU) roughly as follows to provide rock-solid operation under continuous high load:

1x RTX 2080Ti >= 650W
2x RTX 2080Ti >= 1050W
3x RTX 2080Ti >= 1500W
4x RTX 2080Ti >= 1900W

The wattages stated refer to the nominal wattage of the PSU. Note that in the US, you will likely be limited to 3 RTX 2080Ti, as standard home/office wall outlets won’t allow you to run more than a 1600W PSU from a single outlet. An 80 PLUS Platinum rated PSU would be advisable for a high-end workstation class machine from a reliability and efficiency perspective.

Each RTX 2080 Ti requires two 8-pin PCIe auxilliary power connectors. Do not use Y-splitters, converters, or daisy chaining for the auxilliary power cables.

The thread title appears to suggest 2.

Duh. Missed that :-)

Thanks for suggestions, I was using PSU with 1000W, as far as I remember same issue happened while I was using 1 GPU but less frequent. To reproduce the issue, I was only fully running one GPU ~ 270W and the other one was idle, while same frozen could be consistently reproduced. I guess I could try remove one of the GPU tonight, see If I can still reproduce that.

Another thing I noticed is
It happens more frequently when I set small batch size calculation, meaning gpu will finish each batch faster and get batches in a shorter time gap.

Will also try newer driver.
Question I have, should I also switch from cuda 10.0 --> cuda 10.1 ? does it affect?

I listed rough numbers for the power supply. A 1000W PSU should be OK for a system with two GPUs. Make sure the power connectors are plugged in correctly and that the GPUs are firmly seated in the PCIe slots (they should be secured with screws or a push-down bar at the bracket so they cannot wiggle out of the PCIe slot).

You could try operating each GPU separately and also cycle the GPUs through the PCIe slots to check if the issue is related to a particular GPU or a particular PCIe slot.

Instability issues are notoriously hard to diagnose remotely. As Robert Crovella stated, the most frequent hardware-related issue is insufficient power supply. The second most frequent hardware-related issue is over-heating. The open-fan design NVIDIA chose for the RTX line typically causes lower GPU temperature in systems with a single GPU, but can cause issues with high GPU temperatures in systems where multiple GPUs operate in close vicinity, as air flow is impeded. What GPU temperatures are reported by nvidia-smi?

The Intel i7 8700 has only 16 PCIe lanes. I assume they are configured here as 2 x 8 to operate the two GPUs?

You would also want to address any potential issues on the software side, as Robert Crovella pointed out above.

You can put a newer driver on the machine without changing CUDA version or any other part of your software stack.

That’s what I meant when I said this:

Thanks!

The temperature of GPU and CPU while the frozen problem occurs.
one fully running GPU ~75-80 C
CPU cores 75-82 C

I’ll make sure everything connect to PSW stably tonight and try to reproduce it.

For the drivers, I guess I’ll try newest version tonight.

Will let you guys know any updates, thanks for help, it’s just a bit tough for me to debug that without any log info and I almost going to RMA them, but gpu-burn 4mins tests showing both are working okay and browsing sites, everything looks fine.

Sorting out these problems both on the HW and SW side is often a matter of careful trial.

You can put “if possible” after each of these:
Try a different PSU.
Try a different GPU.
Try a different motherboard.

Try a different GPU driver.
Try a different CUDA version.
Try a different Pytorch version.
Try a different cudnn version.

Or if new enough, RMA.

As a test, you might also try providing a lot of additional cooling via a powerful fan directed at the system and/or GPUs. It’s possible that the freeze has nothing to do with the GPUs.

The temperatures reported look OK.

Thanks guys for help, just some updates here.

I kind of resembled some of the components last night, basically unplug and plug some cables, didn’t see big loosen wires and ran the same code to reproduce the problem. Surprisingly it did not freeze , I don’t know if it’s a coincidence or I literally solved the issue, but I left the computer running the same segmentation model this morning, will see how it goes.
I kind of doubt it’s being 100% solved… maybe cause I’ve thought that too many times and was wrong.

Meanwhile did a more throughly test on cpu cores, on very intensive task, temp could go up to 90+ C, which I think in general is not too good, bought case Corsair Air 540 on Amazon, hope that would help for both cpu and gpu cooling airflow.

I guess I’ll just follow this list, if frozen happened again.
Try a different GPU driver.
Try a different CUDA version.
Try a different Pytorch version.
Try a different cudnn version.

Or maybe then
Try a different PSU.
Try a different GPU.
Try a different motherboard.
which seems to be more expensive to try :)

Actually, unplugging and replugging parts can sometimes fix weird issues for good (or at least for a lengthy period of time). The basic mechanism behind that is that issues with poor electrical contact may be remedied, e.g. an oxidation layer or dirt may be sufficiently removed from contacts in the process.

Before you install a new fan on your CPU I suggest checking for dust accumulation on the CPU heat sink and fan. Dust accumulation is a common problem that can easily raise CPU temperatures by 10 degrees compared to a clean cooling combo. The same issue can affect GPUs.

Sadly, it happened again tonight, not in the long training process though, happened suddenly while performing short inference. Haven’t been able to easily reproduce that though so far.
Tough game…

You stated that the system “freezes” when it fails. How long did you wait before you decided that the system had frozen for good (rather than being temporarily inaccessible)? Did the system remain frozen or did it eventually reboot by itself?

If you use dmesg or similar syslog facilities, are there any indications of trouble in the run-up to these freezes?

When freeze happens, the system becomes unresponsive, the screen freezes at certain point, still display the last image, but frozen, while mouse can’t move and keyboard is disconnected (even press CapsLock won’t response). If unplug and replug DP cable, the same image would appear, in the same manner and frozen. I did wait for 10 mins or so, there’s no big change.

I guess I’m not expert enough to understand all the logs, if there is anything you think would be useful, I’ll just attach them here. Thanks!

Results are run after reboot.
nvidia-bug-report.log.gz (1.11 MB)
log_messages.txt (7.55 KB)

Hi guys,

I updated cuda 10.0 --> 10.1 with newest cudnn and also updated to newest driver 430.50. changed python --> 3.7.3
still no luck, I’m still puzzled if it’s hardware problem or software problem … lol
Can you guys give me some suggestions? Thanks a lot!

I don’t think this is going to help you in any way, but I just had the same problem with an RTX 2070 in a Ubuntu 16.04 system.
The problem got resolved when I switched the CUDA code to use an older GPU that was also present in the machine.

I am out of ideas, but based on the information presented this far I see no indications that the problem is caused by the GPUs.