Ubuntu 18.04 and RTX 2080 SUPER systematically freezing

Thank you for your kind considerations, @generix.

After today’s formatting attempt (from 18.04.4 LTS to 19.10) and BIOS update (4.00 to 4.30), some settings I’ve been using were restored to default. I’ve performed steps 1-5 (list above) again just to guarantee. After the BIOS update, I’ve played a DoTA2 match and after about 30-40 minutes my PC froze again. In complete despair, I turned my PC off, opened its cover, disconnected, and connected my GPU. Since then I’m running one instance of my deep learning model and the freeze did not happen (but I think it’s a happy coincidence). I’ll let it be for today to see if something has changed at all, but if not I’ll install Ubuntu 20.04. Since this version have a new kernel I think the chances of improving my situation are higher. When formatting I’m keeping my /home folder. Do you think it can be a problem in this case since many configuration files are being kept?

About your workaround, what frequency do you suggest? sudo nvidia-smi -lgc 1900?

Maybe the mentioned “quality problems” were in fact a bad connection, e.g. oxidized slot pins/power connectors.
I guess -lgc take upper and lower limits, -lgc 300,1900 might be worth a shot.

1 Like

Unfortunately, it was indeed a happy coincidence. My PC froze again after playing for 40 minutes. I will install Ubuntu 20.04 now and will try to lock the upper and lower limits as suggested. Do you think keeping /home can be a problem in this case when formatting?

I’m with Ubuntu 20.04 now and with limited clock settings. I used “sudo nvidia-smi -lgc 300,1650” just as a starting monitor point. Let’s see how it goes.

Hey there, @generix.

I’m here just to post an update. Limiting my GPU clock wasn’t able to stop the freezings. Setting the upper limit even lower (1100) also didn’t help, what made me exclude problems with my PSU. I came across this post yesterday and it caught my attention: someone was reporting a problem with NVIDIA’s adaptive mode. I tried the solution and today I didn’t experience any freeze. I also disabled GNOME’s notifications, but I don’t know if it has anything to do with this problem.

In case anyone looks for this post in the future, what seemed to have solved the problem for me was to put the following command on my startup applications (it changes the GPU performance level to full performance):

nvidia-settings -a [gpu:0]/GpuPowerMizerMode=1"

Screenshot from 2020-06-09 22-55-17

I also forgot to tell you how this problem started. I bought my new PC on January and when it arrived I installed Ubuntu 18.04 and Windows 10 in a dual boot setting. Everything was ok until one day I entered Windows. Just after that, I was experiencing the same errors I have today. I don’t recall how I was able to solve the problem, but it somehow stopped happening. Last week or so I logged into Windows after a long time, and the problem came back again. I don’t know if it updated something, but Windows certainly messed some of my BIOS configurations (I had to change my DRAM frequency settings again, for example). Is it a coincidence or Windows can cause a problem like that?

I will come back here in a week or so to report if everything is still functioning. If so I will close this post and mark it as solved.

Only things that come to my mind is that some vendors distributing bios updates through windows update, in that case ‘loading bios defaults’ should remove any bad settings.
The second thing being the mainboard being left in some bad state after Windows shutdown but then disconnecting the power and let it sit unpowered for half an hour or so should bring it back. Though you already did so when removing the card.
Otherwise, Windows shouldn’t be able the bios settings unless some vendor software is installed for that purpose.
I don’t know if Windows’ ‘Fast Start’ has any influence on that.

1 Like

The error persists and it is also happening on Windows 10 (it behaves the same way described on the first post). In Ubuntu I can’t even SSH from another machine to see what’s happening or try to restart my PC. I’m running out of options and contacting the motherboard vendor right now to see if they can give some light to this situation. :(

Edit: this post reveals interesting information regarding PSU problems. It seems almost exacly like I described. How can I know if my PSU is damaged?

Unfortunately, those kind of problems are only debuggable by swapping parts.

Dear all,

We are having the same issue with a GPU Cluster

We bought 40 (forty) GPUs Asus GForce RTX model TURBO-RTX2070S-8G-EVO for a GPU Cluster and have problems with all these cards.

Here is the problem description, with additional and in depth details we can update at any time.
One single node has this configuration:
4 GPU cards Asus GForce RTX, MB ASRock Fatal1ty X399 Professional Gaming , AMD Ryzen Threadripper 2950X processor, Corsair DIMM 256 GB DDR4-2666 Octo-Kit, SilverStone SST-ST1500-GS, PC-Netzteil (1500W), WD Black SN750 500 GB, Solid State Drive, 3xNoctua NF - F12.
All firmware/BIOS up to date.

as software : Ubuntu Server 18.04, NVIDIA-SMI 440.59 Driver Version: 440.59 CUDA Version: 10.2 (10.1 also installed), cudnn 7.6.5.32-1+cuda10.2

The problem occurs when running TensorFlow the system runs the training and suddenly stops with no power!
there is no log, no console output. Temperature on the system and cards never go above 70degree C.
we tried with one single GPU in the same HW config. The system “freezes” (still has power). the same no logs, no console output.

So is no lack of power. Actually we tried with different Power Supplies 1600W. the issue is the same.

If we run the same HW config BUT with old 4x Zotac Gaming GeForce RTX 2070 (not the ASUS 2070SUPER Turbo version) the system has no problems.

We have contacted ASUS and retailer support, both said they DO NOT support Linux.

Could not find a solution yet!

[Product Information]
Product Type : Graphic Card
Product Model : TURBO-RTX2070S-8G-EVO
Operation System / Firmware or BIOS version : Ubuntu Server 18.04

[Motherboard Vendor/model ]
ASRock Fatal1ty X399 Professional Gaming

[CPU vendor/processor number]
AMD Ryzen Threadripper 2950X

[Memory vendor/model/specification]
Corsair DIMM 256 GB DDR4-2666 Octo-Kit

Hello @toamna2012, I’m sad to hear you also have this problem. Can you see on any of your logs if you find NVRM: Xid 61 errors? Since you’re with AMD Ryzen this thread should be of interest for you.

Edit: I contacted ASRock yesterday with a carefully written explanation to my problem (that occurs on Linux and Windows) and all they can say is also “We don’t support Linux”. It’s annoying and sometimes it seems they aren’t reading.

So far, the XID 61 error didn’t occur on Threadripper system, furthermore it’s rather a power management problem, happening on idle.
Since you both have problem with the combination of ASrock boards and RTX SUPERs, this might really be some board specific problem. Maybe the SUPERs are drawing too much power over the pcie bus for the mainboard’s voltage regulators, don’t know.
A general problem with the PSU would have different symptoms, the gpu would just shut down and the driver reports an XID 79 error while the system keeps running.
It’s often the problem with vendors, as soon as the word ‘Linux’ appears, they instantly cut the line. You’ll always have to reproduce the issue on Windows, and ‘Don’t mention the Linux’

1 Like

For me, running the same 4x Product Model : TURBO-RTX2070S-8G-EVO GPUs (all four in one system) in a Intel I7 and Gigabit MB GA-X99 runs without problems.
So could be that only AMD+AsRock+ SUPER cards have this problem.

still trying with some setting from the other thread.

Hello @generix and @toamna2012. Some updates: I’ve been running my setup without problems or interruptions for the past three days. The only thing I’ve changed was my DRAM operation frequency: from 3600 MHz (maximum) to 3000 MHz. After I noticed my settings were stable I also used “sudo nvidia-smi -lgc 300,2115” to return to my old GPU frequency configurations and it’s still ok. Do you have any insights about why this is happening? Is there any way I can check for DRAM problems before asking the RMA for a motherboard replacement?

According to intel
https://ark.intel.com/content/www/de/de/ark/products/190887/intel-core-i9-9900kf-processor-16m-cache-up-to-5-00-ghz.html
Your cpu only suports 2666MHz memory clocks, so you have been heavily overclocking the memory.

1 Like

why did you use nvidia-smi -lgc 300,2115 ?

"NVIDIA has paired 8 GB GDDR6 memory with the GeForce RTX 2080 SUPER, which are connected using a 256-bit memory interface. The GPU is operating at a frequency of 1650 MHz, which can be boosted up to 1815 MHz, memory is running at 1937 MHz. "

should be nvidia-smi -lgc 300,1815 ?

1 Like

OMG, that’s crazy. I’m using “nvidia-smi -lgc 300,2115” because it’s what “PowerMizer” originally showed (figure attached). According to GeForce RTX 2080 SUPER’s specifications, the maximum I should expect is really something around 1815 MHz.

I have a second computer here using a ZOTAC GTX 1080 AMP!. According to the specifications, this GPU can be boosted up to 1822 MHz. You can see below a screenshot of what “PowerMizer” is showing there. When I bought this computer I just installed NVIDIA’s driver and let it be. Is it overclocked by default? Why is it reaching such high frequencies when it shouldn’t?

This phenomenon has been noticed before, especially with Turing cards, that the nvidia Linux driver doesn’t use the stock clocks but the vendor defined OC clocks. Depending on gpu temperature it shouldn’t actually reach those clocks, though.

1 Like

That’s interesting. I noticed yesterday that I was reaching something near 2050 MHz. My GPU never reached temperatures higher than 45 °C. On the second computer (ZOTAC GTX 1080 AMP!) the current clock is 1949 MHz and the temperature is 51 °C, pretty low.

workaround:
set in BIOS:
suspend to RAM ->DISABLED;
Global C states Control → DISABLED
ACPI_CST C1 Declaration → DISABLED
PCIE Reset Control → DISABLED

set nvidia-smi pm 1, nvidia-smi lgc 1600,1605

1 Like