Ubuntu 18.04 and RTX 2080 SUPER systematically freezing

Dear all,
I’m trying to solve an insistent problem: my Ubuntu 18.04.4 LTS is randomly freezing when (apparently) using my graphics card (RTX 2080 SUPER). Kernel logs don’t show anything useful (syslog, kern, and xorg attached).

When the freezing occurs, I can’t use either mouse or keyboard. Pressing any key on my keyboard also makes the numlock key goes off, which makes it impossible to safely reboot my system with the ALT + SysRQ method. Disconnecting/reconnecting the USB cables does not solve the problem.

I’ve noted this freezing problem occurs under two circumstances: (i) while training a deep learning model using TensorFlow 1.14 and CUDA, and (ii) while playing DoTA2 (although this DoTA2 freezing is fairly new, it does not occur every time).

I have already tried the following “possible solutions”, but to no avail:

  1. Setting my fans to full speed/performance mode, thinking it could be an overheating problem (although it was unlikely, given that my PC is new);
  2. Placing nouveau drivers in /etc/modprobe.d/blacklist.conf;
  3. Changing a BIOS setting for “suspend to ram disabled”;
  4. Switching from gdm to lightdm (as recommended in this post);
  5. Switching from nvidia-driver-440 to nvidia-driver-435 (both proprietary drivers);
  6. Formatting my PC and reinstalling Ubuntu 18.04.4 LTS.

I don’t know if it’s useful, but I also have Windows 10 on dual boot with Ubuntu 18.04.4 LTS. I’ve formatted my computer yesterday thinking it would solve my problem, so I’m up to anything.

Any help would be appreciated. The nvidia-bug-report.log is also attached to this post.

Hardware & Other Settings:
SO: Ubuntu 18.04.4 LTS (Dual Boot: Windows 10)
Kernel: Linux 5.3.0-53-generic
Processor: Intel Core i9-9900KF 3.60GHz (5.0GHz Turbo)
Graphics: Asus Rog Strix GeForce RTX 2080 SUPER/PCIe/SSE2 8GB GDDR6 256Bit
GL Version: 4.5.0 NVIDIA 440.59
Motherboard: ASRock Z390 Extreme 4 Chipset Z390 Intel LGA 1151 ATX DDR4
Memory: DDR4 Corsair Vengeance RGB Pro (4x8GB) 3600MHz
Water Cooler: Corsair H115i Pro RGB 280mm
Power Supply: XFX 650W XTR Series ATX/EPS Full Modular 80PLUS GOLD, P1-650B-BEFX
Storage: SSD Corsair Force MP510 960GB M.2 2280 NVMe and HD Seagate Barracuda 1TB (only used as extra storage space, mounted on /mnt/data)

Attached Files:
nvidia-bug-report.log (1.4 MB)
xorg.log (146.8 KB)

Might be a mainboard issue. Please check if a bios update helps.

1 Like

My current BIOS version is 4.00. Should I update to 4.30 (link)?

Worth a try.

My BIOS is updated to the latest version (4.30) and the problem persists. My system froze minutes ago playing DoTA2.

Unfortunately, spontaneous freezes/reboots point to problems with the mainboard. The driver might be able to crash the OS/kernel but it shouldn’t be able to crash the board.
Also, it’s impossible to debug these kinds of things if you’re not the mainboard/bios manufacturer.
Similar case, different board:

Shouldn’t this problem occur on Windows too? Everything seems to work just fine over there.

Maybe yes, maybe no. From observation I can only tell that the linux driver does more aggressive clocking than the Windows driver, resulting in the slightest mainboard/psu flaws becoming apparent. Furthermore, the RTX series already raised requirement regarding pcie and power supply quality, the SUPER series even more.
Then, of course, there might be some bug in kernel/driver that triggers some board specific bios bug. I don’t really think so since you’re running a pretty common board.
To work around it, you could try nvidia-smi -lgc to limit the gpu clocks.

1 Like

Thank you for your kind considerations, @generix.

After today’s formatting attempt (from 18.04.4 LTS to 19.10) and BIOS update (4.00 to 4.30), some settings I’ve been using were restored to default. I’ve performed steps 1-5 (list above) again just to guarantee. After the BIOS update, I’ve played a DoTA2 match and after about 30-40 minutes my PC froze again. In complete despair, I turned my PC off, opened its cover, disconnected, and connected my GPU. Since then I’m running one instance of my deep learning model and the freeze did not happen (but I think it’s a happy coincidence). I’ll let it be for today to see if something has changed at all, but if not I’ll install Ubuntu 20.04. Since this version have a new kernel I think the chances of improving my situation are higher. When formatting I’m keeping my /home folder. Do you think it can be a problem in this case since many configuration files are being kept?

About your workaround, what frequency do you suggest? sudo nvidia-smi -lgc 1900?

Maybe the mentioned “quality problems” were in fact a bad connection, e.g. oxidized slot pins/power connectors.
I guess -lgc take upper and lower limits, -lgc 300,1900 might be worth a shot.

1 Like

Unfortunately, it was indeed a happy coincidence. My PC froze again after playing for 40 minutes. I will install Ubuntu 20.04 now and will try to lock the upper and lower limits as suggested. Do you think keeping /home can be a problem in this case when formatting?

I’m with Ubuntu 20.04 now and with limited clock settings. I used “sudo nvidia-smi -lgc 300,1650” just as a starting monitor point. Let’s see how it goes.

Hey there, @generix.

I’m here just to post an update. Limiting my GPU clock wasn’t able to stop the freezings. Setting the upper limit even lower (1100) also didn’t help, what made me exclude problems with my PSU. I came across this post yesterday and it caught my attention: someone was reporting a problem with NVIDIA’s adaptive mode. I tried the solution and today I didn’t experience any freeze. I also disabled GNOME’s notifications, but I don’t know if it has anything to do with this problem.

In case anyone looks for this post in the future, what seemed to have solved the problem for me was to put the following command on my startup applications (it changes the GPU performance level to full performance):

nvidia-settings -a [gpu:0]/GpuPowerMizerMode=1"

Screenshot from 2020-06-09 22-55-17

I also forgot to tell you how this problem started. I bought my new PC on January and when it arrived I installed Ubuntu 18.04 and Windows 10 in a dual boot setting. Everything was ok until one day I entered Windows. Just after that, I was experiencing the same errors I have today. I don’t recall how I was able to solve the problem, but it somehow stopped happening. Last week or so I logged into Windows after a long time, and the problem came back again. I don’t know if it updated something, but Windows certainly messed some of my BIOS configurations (I had to change my DRAM frequency settings again, for example). Is it a coincidence or Windows can cause a problem like that?

I will come back here in a week or so to report if everything is still functioning. If so I will close this post and mark it as solved.

Only things that come to my mind is that some vendors distributing bios updates through windows update, in that case ‘loading bios defaults’ should remove any bad settings.
The second thing being the mainboard being left in some bad state after Windows shutdown but then disconnecting the power and let it sit unpowered for half an hour or so should bring it back. Though you already did so when removing the card.
Otherwise, Windows shouldn’t be able the bios settings unless some vendor software is installed for that purpose.
I don’t know if Windows’ ‘Fast Start’ has any influence on that.

1 Like

The error persists and it is also happening on Windows 10 (it behaves the same way described on the first post). In Ubuntu I can’t even SSH from another machine to see what’s happening or try to restart my PC. I’m running out of options and contacting the motherboard vendor right now to see if they can give some light to this situation. :(

Edit: this post reveals interesting information regarding PSU problems. It seems almost exacly like I described. How can I know if my PSU is damaged?

Unfortunately, those kind of problems are only debuggable by swapping parts.

Dear all,

We are having the same issue with a GPU Cluster

We bought 40 (forty) GPUs Asus GForce RTX model TURBO-RTX2070S-8G-EVO for a GPU Cluster and have problems with all these cards.

Here is the problem description, with additional and in depth details we can update at any time.
One single node has this configuration:
4 GPU cards Asus GForce RTX, MB ASRock Fatal1ty X399 Professional Gaming , AMD Ryzen Threadripper 2950X processor, Corsair DIMM 256 GB DDR4-2666 Octo-Kit, SilverStone SST-ST1500-GS, PC-Netzteil (1500W), WD Black SN750 500 GB, Solid State Drive, 3xNoctua NF - F12.
All firmware/BIOS up to date.

as software : Ubuntu Server 18.04, NVIDIA-SMI 440.59 Driver Version: 440.59 CUDA Version: 10.2 (10.1 also installed), cudnn

The problem occurs when running TensorFlow the system runs the training and suddenly stops with no power!
there is no log, no console output. Temperature on the system and cards never go above 70degree C.
we tried with one single GPU in the same HW config. The system “freezes” (still has power). the same no logs, no console output.

So is no lack of power. Actually we tried with different Power Supplies 1600W. the issue is the same.

If we run the same HW config BUT with old 4x Zotac Gaming GeForce RTX 2070 (not the ASUS 2070SUPER Turbo version) the system has no problems.

We have contacted ASUS and retailer support, both said they DO NOT support Linux.

Could not find a solution yet!

[Product Information]
Product Type : Graphic Card
Product Model : TURBO-RTX2070S-8G-EVO
Operation System / Firmware or BIOS version : Ubuntu Server 18.04

[Motherboard Vendor/model ]
ASRock Fatal1ty X399 Professional Gaming

[CPU vendor/processor number]
AMD Ryzen Threadripper 2950X

[Memory vendor/model/specification]
Corsair DIMM 256 GB DDR4-2666 Octo-Kit

Hello @toamna2012, I’m sad to hear you also have this problem. Can you see on any of your logs if you find NVRM: Xid 61 errors? Since you’re with AMD Ryzen this thread should be of interest for you.

Edit: I contacted ASRock yesterday with a carefully written explanation to my problem (that occurs on Linux and Windows) and all they can say is also “We don’t support Linux”. It’s annoying and sometimes it seems they aren’t reading.

So far, the XID 61 error didn’t occur on Threadripper system, furthermore it’s rather a power management problem, happening on idle.
Since you both have problem with the combination of ASrock boards and RTX SUPERs, this might really be some board specific problem. Maybe the SUPERs are drawing too much power over the pcie bus for the mainboard’s voltage regulators, don’t know.
A general problem with the PSU would have different symptoms, the gpu would just shut down and the driver reports an XID 79 error while the system keeps running.
It’s often the problem with vendors, as soon as the word ‘Linux’ appears, they instantly cut the line. You’ll always have to reproduce the issue on Windows, and ‘Don’t mention the Linux’

1 Like

For me, running the same 4x Product Model : TURBO-RTX2070S-8G-EVO GPUs (all four in one system) in a Intel I7 and Gigabit MB GA-X99 runs without problems.
So could be that only AMD+AsRock+ SUPER cards have this problem.

still trying with some setting from the other thread.