NVIDIA GeForce RTX 2080 SUPER strange fans behaviour

Hello.

Actually, I give a cuda compute load and observe the following: all 3 fans of my gpu work, nvidia-smi shows that fans speed is around 34%, temperature is around 40 C. After 1-2 min one of the fans stops and other 2 fans start to work with 100% speed, which sounds very very noisy.

However, when this happens, nvidia-smi shows fan speed 0%

$ nvidia-smi -q | grep fan -i
Fan Speed : 0 %

The current temperature is 47 C

$ nvidia-smi -q | grep “Current Temp” -i
GPU Current Temp : 47 C
Memory Current Temp : N/A

Looks like there is something wrong with the fan control. Why only 2 fans work? Why they are on 100% RPM , while temp is 47 C? Even when I shutdown my cuda program, this fan apocalypse last 10 mins or more after, and GPU temp temp drops below 30 C. Can anybody help me with this? I use ubuntu 18.04

drivers version:

ii nvidia-driver-470 470.63.01-0ubuntu0.18.04.2 amd64 NVIDIA driver metapackage

The fan curve is set by the manufacturer in VBIOS, often accompanied by a vendor-specific fan control. Which brand/model is the card? Please check first if both fan groups are controlled by the driver:

nvidia-settings -q [fan:0]/GPUCurrentFanSpeed
nvidia-settings -q [fan:1]/GPUCurrentFanSpeed

Hello, thank you for helping.

Which brand/model is the card?

Gigabyte

The fan curve is set by the manufacturer in VBIOS, often accompanied by a vendor-specific fan control.

Ok, regarding this I can state, that i have done nothing with VBIOS, the card came as is from the store. Which exact vendor-specific fan control tools i can use?

Please check first if both fan groups are controlled by the driver:

$ nvidia-settings -q [fan:0]/GPUCurrentFanSpeed
Attribute ‘GPUCurrentFanSpeed’ (alexhoppus-B450-GAMING-X:0[fan:0]): 0.
The valid values for ‘GPUCurrentFanSpeed’ are in the range 0 - 100 (inclusive).
‘GPUCurrentFanSpeed’ is a read-only attribute.
‘GPUCurrentFanSpeed’ can use the following target types: Fan.

$ nvidia-settings -q [fan:1]/GPUCurrentFanSpeed
ERROR: Error resolving target specification ‘fan:1’ (No targets match target specification), specified in query
‘[fan:1]/GPUCurrentFanSpeed’.

Also, another piece of information, which might be useful, is that I didn’t mentioned this noise earlier. It only appears after a year of card usage. Even under any kind of gpu load. Looks like i didn’t changed anything, so it looks strange to me.

Just checked, the same fan behaviour is achieved running glmark2. Just after starting glmark2 the fans are going crazy.

I can’t really make heads or tails of it. The nvidia-settings output points to only one fan being nvidia standard, the two misbehaving vendor-specific. So I wouldn’t expect a driver update being able to change their behaviour but at the same time this issue spontaneously appeared. I could even imagine this being a hardware issue (maybe broken temperature sensor).
Vendor-specific fans are only controllable through NVAPI, which is Windows-only.
So i’d rather recommend checking the fan behaviour in Windows to rule out a hw issue.

i’d rather recommend checking the fan behaviour in Windows to rule out a hw issue

On windows, I got the same behaviour playing witcher 3 for 5 mins. (same pc)

I could even imagine this being a hardware issue (maybe broken temperature sensor).

If this is f.e. broken sensor, than it should report wrong temperature, but it reports 30-40 C, which looks correct.

Sometimes it is possible to manual control the fans using

$ nvidia-settings -a ‘[gpu:0]/GPUFanControlState=1’ -a ‘[fan:0]/GPUTargetFanSpeed=X

in this way fan0 looks like controls all 3 fans of the card. However, something interferes this manual control and speed ups fans to huge rate. At this point i can’t control their speed.

My guess, is that auto-fan control logic embedded somewhere in the vbios speed-ups fans so badly because of old thermo paste, which gives bad cooling. Will check that.

Hello.

Actually, I have tried the following:

  1. Changed thermopaste and thermopads. This helped a little and noise is not so loud now.
  2. Tried all software i have found on windows like AOURUS engine, EVGa precision and many others. More over, the aourus engine is a software, which was downloaded from offical gigabyte official web site. However, it looks like nothing gives any effect. I have tried to set TEmp curve, manual temp, disable/enable 3d active fan. Nothing is working the card behaves same as before. Before that I have tried to do same with nvidia-smi on linux. Problem occurs both on linux and windows.

I will remind what is the exact problem i try to solve:

On gpu/compute (cuda) load the fans are LOUD. F.e. now i see that temp of my card under load is 48 C (while target temp is 84 C). I want less noise and higher temp for my card as a trade. No matter what I did, it works like this: dead silence before 60 C, after that the fans are starting and they spins with a high rate even when temp falls down to 48 C. I can’t say even the exact RPM, because software on windows or linux shows 0 RPM for fan speed.

If anyone know how to solve this, please let me know.

@generix

Thank you.

This really sounds like the fan tachometer on the gpu board is broken.

Is is possible to switch the two misbehaving fans?