So my 2080Ti is literally going up from 40% gpu fan when the temperature is 84, to over 90% (which makes a terrible noise) at 85.
I want to setup the fan control profile so that the gpu fan will go up a little bit sooner, but as far as I can tell, i need to enable coolbits (although in nvidia-settings it literally says “You should never have to enable this.”) and even then, nvidia-settings doesn’t have a thermal profile but literally a slider with a constant gpu fan (which is not a function of temperature).
I talked to Nvidia’s general support, and they could only give me tools in windows to do that and not in Ubuntu, so they referred me here. There are millions of people using Nvidia products with Ubuntu or other dist of linux so there has to be a good way to do what I want, but I can’t find it.
So here I am, confused. How should I go about it?
I took a look at the first 10 options, many of those are irrelevant to my system (I have a desktop, not laptop or mac…) most are outdated, and none are by Nvidia.
Which of those do you recommend for a desktop running Ubuntu 20.04?
Also I think that Nvidia warns the user not to enable coolbits(to the point of voiding the warranty ). None of those repos are by Nvidia. While Nvidia does provide a tool for windows users, I don’t see anything for Linux - even though most deep learning for instance is done in linux. Is that the current situation or do i misunderstand it?
I expected you could hit ctrl+f and type nvidia on that page. This would have led you to e.g.
https://github.com/nan0s7/nfancurve
I actually doubt there is any official nvidia fan-control software for windows, most of them are released by vendors.
Nvidia live chat support was willing to recommend a tool for windows, and not for linux, which is most likely user friendly.
Anyways, because you did point to a specific repo, i tried to give it a shot.
As a prerequisite for using it, I have to enable coolbits.
I have two 2080Ti in my system, and unfortunately nvidia-settings creates config for just one of them (device 0).
I have tried to manually copy-paste and change “Device0” to “Device1” and that didn’t work.
I have also tried running nvidia-xconfig with the all gpus flag, and that resulting in ubuntu not loading at all (but after removing xorg.conf in recovery mode it did load).
Try creating /etc/X11/xorg.conf.d/11-nvidia-coolbits.conf
Section "OutputClass"
Identifier "nvidia"
MatchDriver "nvidia-drm"
Driver "nvidia"
Option "Coolbits" "<desiredvalue>"
EndSection
I tried that with as 4 (eg Option “Coolbits” “4”) right now.
Going to nvidia-settings I am able to control the coolbits for device 0 only (and not device 1)-not sure why as the above settings should have enabled any gpu. I’ll try to delete the conf and see if that helps.
The good news is that I am able to control the fan speed for gpu0 in nvidia-settings (and ubuntu did load)
the bad news is that even after deleting xorg.conf from etc/X11, I am not able to control device1 even after creating the /etc/X11/xorg.conf.d/11-nvidia-coolbits.conf as suggested. Unfortunately that is the problematic gpu (with seemingly problematic fan control profile).
I can see both of them when typing nvidia-smi eg
(duh1) yoni@Garfield:~$ nvidia-smi
Thu Mar 4 15:22:35 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… Off | 00000000:17:00.0 Off | N/A |
| 0% 35C P8 17W / 260W | 6MiB / 11019MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 208… Off | 00000000:65:00.0 On | N/A |
| 27% 29C P8 8W / 250W | 222MiB / 11016MiB | 1% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1261 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1261 G /usr/lib/xorg/Xorg 87MiB |
| 1 N/A N/A 1580 G /usr/bin/gnome-shell 95MiB |
| 1 N/A N/A 2035 G …AAAAAAAAA= --shared-files 38MiB |
±----------------------------------------------------------------------------+
and I am able to train neural nets with both of them, but gpu1 makes a horrible sound…
Maybe it has something to do with it being connected to the screen? :P
I guess the second gpu gets only added as gpuscreen which is not adressable so coolbits are disabled. Please try adding
Option "AllowNVIDIAGPUScreens" "false"
inside the section of the created file.
Still the same (can’t control gpu1’s fan through nvidia-settings).
To make sure we are talking about the same file:
(duh1) yoni@Garfield:/etc/X11/xorg.conf.d$ cat *
Section “OutputClass”
Identifier “nvidia”
MatchDriver “nvidia-drm”
Driver “nvidia”
Option “Coolbits” “4”
Option “AllowNVIDIAGPUScreens” “false”
EndSection
Is there anything else I can try?
You could create an xorg.conf with all gpus, like
Section "ServerLayout"
Identifier "dual"
Screen 0 "Screen0"
Screen 1 "Screen1" RightOf "Screen0"
EndSection
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BusID "PCI:23:0:0"
Option "Coolbits" "4"
Option "AllowEmptyInitialConfiguration"
EndSection
Section "Device"
Identifier "Device1"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BusID "PCI:101:0:0"
Option "Coolbits" "4"
Option "AllowEmptyInitialConfiguration"
EndSection
I’ve tried putting that content in the file i’ve created at /etc/X11/xorg.conf.d, and now Ubuntu won’t load at all unless i delete that file or revert back to the previous content which enabled coolbits only for one of my GPUs…
What can I try now?
My bad, please switch the busid entries of both device sections.
It’s cool, but still no bueno…
I’ve tried:
Section "ServerLayout"
Identifier "dual"
Screen 0 "Screen0"
Screen 1 "Screen1" RightOf "Screen0"
EndSection
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BusID "PCI:101:0:0"
Option "Coolbits" "4"
Option "AllowEmptyInitialConfiguration"
EndSection
Section "Device"
Identifier "Device1"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BusID "PCI:23:0:0"
Option "Coolbits" "4"
Option "AllowEmptyInitialConfiguration"
EndSection
and again ubuntu won’t even load unless i delete the file or revert back.
I have just one physical screen (and two 2080Tis) so maybe it’s the Screen 1 “Screen1” RightOf “Screen0” line? Idk.
Is there any output from my server which will help you determine the issue?
Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.
Yes, look it wants some screen sections, add
Section "Screen"
Identifier "Screen0"
Device "Device0"
Endsection
Section "Screen"
Identifier "Screen1"
Device "Device1"
Endsection
Alright,nvidia-bug-report.log.gz (466.8 KB) Almost there!! Now Ubuntu does load, and I can control both GPUs fans via nvidia-settings (probably means that coolbits are now enabled for both afaik).
Now there’s just two more things:
-
Now when my mouse pointer goes “off the screen” (actually just from the right side, as it doesn’t goes “off the screen” from the left side,I hope it makes sense) it becomes invisible. This didn’t happen before so I suppose it has something to do with the new cfg, how do I fix it? I’ll attach both nvidia-bug-report and the cfg to this message.
-
Which specific tool do you recommend when it comes to controlling the fans dynamically? (eg I want to set GPU1’s fan profile dynamically based on the temperature rather than statically).
duh1) yoni@Garfield:/etc/X11/xorg.conf.d$ ls
11-nvidia-coolbits.conf
(duh1) yoni@Garfield:/etc/X11/xorg.conf.d$ cat *.conf
Section “ServerLayout”
Identifier “dual”
Screen 0 “Screen0”
Screen 1 “Screen1” RightOf “Screen0”
EndSection
Section “Device”
Identifier “Device0”
Driver “nvidia”
VendorName “NVIDIA Corporation”
BusID “PCI:101:0:0”
Option “Coolbits” “4”
Option “AllowEmptyInitialConfiguration”
EndSection
Section “Device”
Identifier “Device1”
Driver “nvidia”
VendorName “NVIDIA Corporation”
BusID “PCI:23:0:0”
Option “Coolbits” “4”
Option “AllowEmptyInitialConfiguration”
EndSection
Section “Screen”
Identifier “Screen0”
Device “Device0”
Endsection
Section “Screen”
Identifier “Screen1”
Device “Device1”
Endsection
Try
Screen 1 "Screen1" Relative "Screen0" 0 3000
instead of
Screen 1 "Screen1" RightOf "Screen0"
This should position the invisible screen under the visible one with a gap inbetween that the mouse pointer can’t jump.