I’m running in Fedora and previously had an RTX 3060 as my only GPU. Everything worked fine then.
Recently, I installed an A6000. As part of this install, I moved the 3060 to the secondary PCIE slot so the A6000 could go in the primary slot.
After updating drivers (515.65) and xconfig, I noticed in nvidia-smi that the 3060 fan speed was 0%. And it continued to stay at 0% even as its temperature approached 60 C.
I went into nvidia-settings (gui) and tried enabling manual fan speed on the 3060 there, however this had no effect. On a whim, I also tried setting manual fan speed on the A6000 and while its fan stayed at 30%, the 3060 then ramped up to the speed I had set for the A6000.
Using command line:
nvidia-settings -a '[gpu:0]/GPUFanControlState=1' -a '[fan:0]/GPUTargetFanSpeed=<whatever>'
This sets the 3060 to the specified <whatever>
fan speed.
nvidia-settings -a '[gpu:1]/GPUFanControlState=1' -a '[fan:1]/GPUTargetFanSpeed=<whatever>'
This sets the A6000 to the specified <whatever>
fan speed.
Note: something spooky happened here. Details further down…
nvidia-settings -a '[gpu:0]/GPUFanControlState=0'
This drops the 3060 back to 0% fan.
Further adding to my confusion:
$ nvidia-settings -q fans --verbose
3 Fans on XXXX:1
[0] XXXX:1[fan:0] (Fan 0)
Has the following name:
FAN-0
Is connected to the following GPU:
XXXX:1[gpu:1] (NVIDIA RTX A6000)
[1] XXXX:1[fan:1] (Fan 1)
Has the following name:
FAN-1
Is not connected to any GPU.
[2] XXXX:1[fan:2] (Fan 2)
Has the following name:
FAN-2
Is connected to the following GPU:
XXXX:1[gpu:0] (NVIDIA GeForce RTX 3060)
Note: the 3060 does have 2 fans on it, but they both seem to be running/not-running at the same speed, controlled by manually setting fan:0 target speed while the A6000 has only 1 fan and is controlled by manually settings fan:2. Setting fan:1 doesn’t appear to do anything.
Re: spooky thing from above: During the course of writing this post, I have continued exploring details about the what’s going on, and while it was true at the time I wrote it that fan:1 set the A6000’s fan speed, that is no longer the case. Maybe on the first go something in the nvidia-settings gui was setting fan:2 and it was just a fluke that when I ran the first command to enable manual control on it, it looked like fan:1 controlled it. Either way, fan:0 belongs to the 3060 while nvidia-settings thinks it belongs to the A6000, and fan:2 belongs to the A6000 while nvidia-settings thinks it belongs to the 3060.
So it would seem that nvidia-settings is very confused about which fan(s) are connected to which GPU. I am currently planning to remove and reinstall the drivers in the blind hope that this solves my problem, but if anyone out there has other options, I’d like to hear them. (It would make my life exponentially easier if there were some kind of config file I could edit or a utility I could run that would correctly enumerate the gpus and their fans.)
Any help understanding and correcting this issue would be greatly appreciated! If there’s any additional information I can provide that would help, please ask.