Hi guys, I’m getting repeatable system freezes whenever there is substantial GPU load using PyTorch. After around 5-20 minutes of training normally (no errors displayed, no memory issues, temperatures below 80 degC), my system freezes (no mouse/keyboard, no tty, no remote SSH, no displayed error messages, reset button required).
Things I have tried:
- Upgrading from 18.04 LTS to 19.04
- Various versions of the Linux Kernel
- Every compatible driver (currently using the default that is installed by 19.04, 418.56)
- Reseating the GPUs
- gpu-burn (one of the 2080Ti’s did have a memory issue, but has been RMA’d and replaced)
- Swapping the GPUs and using each one individually
I’ve attached the output from nvidia-bug-report.sh. I’m also using a dual monitor setup in case that makes a difference.
There is also a discussion here of some other troubleshooting steps I’ve taken over the last few months Repeatable system freezes under GPU load with Threadripper & Ubuntu 18.04 - #9 by TuxKey - GPU - Level1Techs Forums
nvidia-bug-report.log.gz (1.37 MB)
There are actually no errors visible in the logs but since you’re running cuda with multiple gpus, please disable iommu either in bios or using the kernel parameter iommu=off and check if that resolves the issue. If not, please create a new nvidia-bug-report.log right after occurrence.
Thanks for your quick reply! I found two settings in the BIOS relating to IOMMU and I disabled them (I’m not entirely sure what they do, but posting in case anyone else finds it useful):
Advanced > AMD PBS > Enumerate all IOMMU in IVRS = Off
Advanced > AMD CBS > NBIO Common Options > NB Configuration > IOMMU = Disabled
I’ve just started running it again and I’ll report back with any changes
No luck with the IOMMU settings. I also tried making an xorg.conf file (using sudo nvidia-xconfig) and then adding Option “Interactive” “0”, however I’m still getting the system freezes.
I’ve attached another bug report and my xorg.conf file
xorg.conf.txt (1.26 KB)
nvidia-bug-report.log.gz (1.67 MB)
Tough luck. Again, no errors logged so it’s very hard to debug since it’s also completely freezing this might be anything.
Things to try:
- downgrade kernel to 4.x, the nvidia driver still has issues with 5.x.
- stop X, then run the load to crash it and hope anything gets logged to console.
- lower system memory clocks in bios
- set pcie gen2 in bios
I spent most of yesterday trying new things (BIOS settings etc.) and just for fun I tried a different DisplayPort on the older card (i.e. the non-RMA’d one), and it showed screen tearing. I removed that one from my system and reordered my cards and it has been working fine for the last 24 hours. I also dropped in an older GTX 1080 just to make sure it’s not a PCIe, kernel or driver issue and it’s been happy since.
My suspicion is that this card has also developed a fault, but weirdly it manifested itself in a strange way by just freezing the system under load (which was repeatable, but could be anywhere between 5 mins & 12 hours after the load started), but worked fine under light load (with the exception of the dodgy DisplayPort). The fault in my other card was much easier to track down since PyTorch couldn’t allocate memory and gpu-burn failed.
Long story short I think this card will also need to be RMA’d. Thanks for your helpful suggestions though generix