Actually, I give a cuda compute load and observe the following: all 3 fans of my gpu work, nvidia-smi shows that fans speed is around 34%, temperature is around 40 C. After 1-2 min one of the fans stops and other 2 fans start to work with 100% speed, which sounds very very noisy.
However, when this happens, nvidia-smi shows fan speed 0%
$ nvidia-smi -q | grep fan -i
Fan Speed : 0 %
The current temperature is 47 C
$ nvidia-smi -q | grep “Current Temp” -i
GPU Current Temp : 47 C
Memory Current Temp : N/A
Looks like there is something wrong with the fan control. Why only 2 fans work? Why they are on 100% RPM , while temp is 47 C? Even when I shutdown my cuda program, this fan apocalypse last 10 mins or more after, and GPU temp temp drops below 30 C. Can anybody help me with this? I use ubuntu 18.04
drivers version:
ii nvidia-driver-470 470.63.01-0ubuntu0.18.04.2 amd64 NVIDIA driver metapackage
The fan curve is set by the manufacturer in VBIOS, often accompanied by a vendor-specific fan control. Which brand/model is the card? Please check first if both fan groups are controlled by the driver:
The fan curve is set by the manufacturer in VBIOS, often accompanied by a vendor-specific fan control.
Ok, regarding this I can state, that i have done nothing with VBIOS, the card came as is from the store. Which exact vendor-specific fan control tools i can use?
Please check first if both fan groups are controlled by the driver:
$ nvidia-settings -q [fan:0]/GPUCurrentFanSpeed
Attribute ‘GPUCurrentFanSpeed’ (alexhoppus-B450-GAMING-X:0[fan:0]): 0.
The valid values for ‘GPUCurrentFanSpeed’ are in the range 0 - 100 (inclusive).
‘GPUCurrentFanSpeed’ is a read-only attribute.
‘GPUCurrentFanSpeed’ can use the following target types: Fan.
$ nvidia-settings -q [fan:1]/GPUCurrentFanSpeed
ERROR: Error resolving target specification ‘fan:1’ (No targets match target specification), specified in query
‘[fan:1]/GPUCurrentFanSpeed’.
Also, another piece of information, which might be useful, is that I didn’t mentioned this noise earlier. It only appears after a year of card usage. Even under any kind of gpu load. Looks like i didn’t changed anything, so it looks strange to me.
I can’t really make heads or tails of it. The nvidia-settings output points to only one fan being nvidia standard, the two misbehaving vendor-specific. So I wouldn’t expect a driver update being able to change their behaviour but at the same time this issue spontaneously appeared. I could even imagine this being a hardware issue (maybe broken temperature sensor).
Vendor-specific fans are only controllable through NVAPI, which is Windows-only.
So i’d rather recommend checking the fan behaviour in Windows to rule out a hw issue.
i’d rather recommend checking the fan behaviour in Windows to rule out a hw issue
On windows, I got the same behaviour playing witcher 3 for 5 mins. (same pc)
I could even imagine this being a hardware issue (maybe broken temperature sensor).
If this is f.e. broken sensor, than it should report wrong temperature, but it reports 30-40 C, which looks correct.
Sometimes it is possible to manual control the fans using
$ nvidia-settings -a ‘[gpu:0]/GPUFanControlState=1’ -a ‘[fan:0]/GPUTargetFanSpeed=X’
in this way fan0 looks like controls all 3 fans of the card. However, something interferes this manual control and speed ups fans to huge rate. At this point i can’t control their speed.
My guess, is that auto-fan control logic embedded somewhere in the vbios speed-ups fans so badly because of old thermo paste, which gives bad cooling. Will check that.
Changed thermopaste and thermopads. This helped a little and noise is not so loud now.
Tried all software i have found on windows like AOURUS engine, EVGa precision and many others. More over, the aourus engine is a software, which was downloaded from offical gigabyte official web site. However, it looks like nothing gives any effect. I have tried to set TEmp curve, manual temp, disable/enable 3d active fan. Nothing is working the card behaves same as before. Before that I have tried to do same with nvidia-smi on linux. Problem occurs both on linux and windows.
I will remind what is the exact problem i try to solve:
On gpu/compute (cuda) load the fans are LOUD. F.e. now i see that temp of my card under load is 48 C (while target temp is 84 C). I want less noise and higher temp for my card as a trade. No matter what I did, it works like this: dead silence before 60 C, after that the fans are starting and they spins with a high rate even when temp falls down to 48 C. I can’t say even the exact RPM, because software on windows or linux shows 0 RPM for fan speed.
If anyone know how to solve this, please let me know.