Repeatable system freezes under GPU load with Ubuntu 19.04 (2x2080Ti)

anjum.sayed48 · May 2, 2019, 4:24pm

Hi guys, I’m getting repeatable system freezes whenever there is substantial GPU load using PyTorch. After around 5-20 minutes of training normally (no errors displayed, no memory issues, temperatures below 80 degC), my system freezes (no mouse/keyboard, no tty, no remote SSH, no displayed error messages, reset button required).

Things I have tried:

Upgrading from 18.04 LTS to 19.04
Various versions of the Linux Kernel
Every compatible driver (currently using the default that is installed by 19.04, 418.56)
Reseating the GPUs
gpu-burn (one of the 2080Ti’s did have a memory issue, but has been RMA’d and replaced)
Swapping the GPUs and using each one individually

I’ve attached the output from nvidia-bug-report.sh. I’m also using a dual monitor setup in case that makes a difference.

There is also a discussion here of some other troubleshooting steps I’ve taken over the last few months Repeatable system freezes under GPU load with Threadripper & Ubuntu 18.04 - #9 by TuxKey - GPU - Level1Techs Forums

nvidia-bug-report.log.gz (1.37 MB)

generix · May 2, 2019, 4:35pm

There are actually no errors visible in the logs but since you’re running cuda with multiple gpus, please disable iommu either in bios or using the kernel parameter iommu=off and check if that resolves the issue. If not, please create a new nvidia-bug-report.log right after occurrence.

generix · May 2, 2019, 4:52pm

Addendum: since you also have X running, you might want to check if this applies:
[url]https://devtalk.nvidia.com/default/topic/1043126/linux/xid-8-in-various-cuda-deep-learning-applications-for-nvidia-gtx-1080-ti/post/5291377/#5291377[/url]

anjum.sayed48 · May 2, 2019, 4:53pm

Thanks for your quick reply! I found two settings in the BIOS relating to IOMMU and I disabled them (I’m not entirely sure what they do, but posting in case anyone else finds it useful):

Advanced > AMD PBS > Enumerate all IOMMU in IVRS = Off
Advanced > AMD CBS > NBIO Common Options > NB Configuration > IOMMU = Disabled

I’ve just started running it again and I’ll report back with any changes

anjum.sayed48 · May 2, 2019, 5:58pm

No luck with the IOMMU settings. I also tried making an xorg.conf file (using sudo nvidia-xconfig) and then adding Option “Interactive” “0”, however I’m still getting the system freezes.

I’ve attached another bug report and my xorg.conf file
xorg.conf.txt (1.26 KB)
nvidia-bug-report.log.gz (1.67 MB)

generix · May 2, 2019, 6:58pm

Tough luck. Again, no errors logged so it’s very hard to debug since it’s also completely freezing this might be anything.
Things to try:

downgrade kernel to 4.x, the nvidia driver still has issues with 5.x.
stop X, then run the load to crash it and hope anything gets logged to console.
lower system memory clocks in bios
set pcie gen2 in bios

anjum.sayed48 · May 4, 2019, 6:02am

I spent most of yesterday trying new things (BIOS settings etc.) and just for fun I tried a different DisplayPort on the older card (i.e. the non-RMA’d one), and it showed screen tearing. I removed that one from my system and reordered my cards and it has been working fine for the last 24 hours. I also dropped in an older GTX 1080 just to make sure it’s not a PCIe, kernel or driver issue and it’s been happy since.

My suspicion is that this card has also developed a fault, but weirdly it manifested itself in a strange way by just freezing the system under load (which was repeatable, but could be anywhere between 5 mins & 12 hours after the load started), but worked fine under light load (with the exception of the dodgy DisplayPort). The fault in my other card was much easier to track down since PyTorch couldn’t allocate memory and gpu-burn failed.

Long story short I think this card will also need to be RMA’d. Thanks for your helpful suggestions though generix

Topic		Replies	Views
Ubuntu 18.04 freezed when using gpu-burn on RTX2080 Ti Linux	1	1006	December 9, 2019
410.66 crash and system freeze under heavy load (Xid 8, Xid 38) Linux	13	1989	November 15, 2018
Pascal Titan X/1080 leads to frozen machines (Ubuntu 14.04/16.04, 375.20, 375.39, 378.13) Linux	12	2312	July 3, 2017
Ubuntu 18.04 with 2 RTX 2080 Ti screen frozen when training deep learning models Linux	15	1373	October 4, 2019
Titan V freezes on Ubuntu 14.04 TITAN	0	850	February 18, 2019
Repeated system crash Ubuntu 22.04 2080Ti Linux	3	567	November 19, 2023
Ubuntu 18.04 and RTX 2080 SUPER systematically freezing Linux cuda , tensorflow , ubuntu	27	3813	October 12, 2021
Tensorflow freezes during training (Linux OS) CUDA Programming and Performance	1	1361	April 11, 2018
Nvidia 331.38 frequent Ubuntu 13.10 freeze , GTX 780M Kernel 3.11 Linux	4	5186	February 7, 2014
Ubuntu 18.04 completely freezes after a few minutes of being booted Linux	25	18333	October 8, 2021

Repeatable system freezes under GPU load with Ubuntu 19.04 (2x2080Ti)

Related topics