After adding a second GPU, automatic fan control on first GPU leaves it at 0% always

I’m running in Fedora and previously had an RTX 3060 as my only GPU. Everything worked fine then.

Recently, I installed an A6000. As part of this install, I moved the 3060 to the secondary PCIE slot so the A6000 could go in the primary slot.

After updating drivers (515.65) and xconfig, I noticed in nvidia-smi that the 3060 fan speed was 0%. And it continued to stay at 0% even as its temperature approached 60 C.

I went into nvidia-settings (gui) and tried enabling manual fan speed on the 3060 there, however this had no effect. On a whim, I also tried setting manual fan speed on the A6000 and while its fan stayed at 30%, the 3060 then ramped up to the speed I had set for the A6000.

Using command line:

nvidia-settings -a '[gpu:0]/GPUFanControlState=1' -a '[fan:0]/GPUTargetFanSpeed=<whatever>'

This sets the 3060 to the specified <whatever> fan speed.

nvidia-settings -a '[gpu:1]/GPUFanControlState=1' -a '[fan:1]/GPUTargetFanSpeed=<whatever>'

This sets the A6000 to the specified <whatever> fan speed.
Note: something spooky happened here. Details further down…

nvidia-settings -a '[gpu:0]/GPUFanControlState=0'

This drops the 3060 back to 0% fan.

Further adding to my confusion:

$ nvidia-settings -q fans --verbose

3 Fans on XXXX:1

    [0] XXXX:1[fan:0] (Fan 0)

      Has the following name:
        FAN-0

      Is connected to the following GPU:
        XXXX:1[gpu:1] (NVIDIA RTX A6000)

    [1] XXXX:1[fan:1] (Fan 1)

      Has the following name:
        FAN-1

      Is not connected to any GPU.

    [2] XXXX:1[fan:2] (Fan 2)

      Has the following name:
        FAN-2

      Is connected to the following GPU:
        XXXX:1[gpu:0] (NVIDIA GeForce RTX 3060)

Note: the 3060 does have 2 fans on it, but they both seem to be running/not-running at the same speed, controlled by manually setting fan:0 target speed while the A6000 has only 1 fan and is controlled by manually settings fan:2. Setting fan:1 doesn’t appear to do anything.

Re: spooky thing from above: During the course of writing this post, I have continued exploring details about the what’s going on, and while it was true at the time I wrote it that fan:1 set the A6000’s fan speed, that is no longer the case. Maybe on the first go something in the nvidia-settings gui was setting fan:2 and it was just a fluke that when I ran the first command to enable manual control on it, it looked like fan:1 controlled it. Either way, fan:0 belongs to the 3060 while nvidia-settings thinks it belongs to the A6000, and fan:2 belongs to the A6000 while nvidia-settings thinks it belongs to the 3060.

So it would seem that nvidia-settings is very confused about which fan(s) are connected to which GPU. I am currently planning to remove and reinstall the drivers in the blind hope that this solves my problem, but if anyone out there has other options, I’d like to hear them. (It would make my life exponentially easier if there were some kind of config file I could edit or a utility I could run that would correctly enumerate the gpus and their fans.)

Any help understanding and correcting this issue would be greatly appreciated! If there’s any additional information I can provide that would help, please ask.

Did you already check if this is a regression, e.g. by installing the 470 driver?

I haven’t gotten around to reinstalling the driver yet. That’ll be my project for the weekend. I will update with the results once I’ve done it.

My plan is to first reinstall 515.65 and if that doesn’t fix it, I will try again with the latest 470.

So there has been no change after reinstalling drivers. Neither 515.65 nor 470.141.03.

I’m not sure what else to try at this point.

Some new details:

I’m still running with the 470 driver and have been poking around a bit. While automatic fan speed control still leaves the 3060 at 0, I have noticed a few other things that look better than where I started from. Notably: nvidia-settings now correctly identifies which gpu each fan belongs to:

$ nvidia-settings -q fans --verbose

3 Fans on XXXX:1

    [0] XXXX:1[fan:0] (Fan 0)

      Has the following name:
        FAN-0

      Is connected to the following GPU:
        XXXX:1[gpu:0] (NVIDIA GeForce RTX 3060)

    [1] XXXX:1[fan:1] (Fan 1)

      Has the following name:
        FAN-1

      Is connected to the following GPU:
        XXXX:1[gpu:0] (NVIDIA GeForce RTX 3060)

    [2] XXXX:1[fan:2] (Fan 2)

      Has the following name:
        FAN-2

      Is connected to the following GPU:
        XXXX:1[gpu:1] (NVIDIA RTX A6000)

I hadn’t thought to check this after reinstalling 515.65, so I don’t know if this was also corrected there or not. (All I know is my gut says maybe.)

However, in GPUFanControlState=0 the 3060 still drops to 0% which is my primary concern. I don’t believe I ever observed it doing this previously when it was the only GPU in the system, so I’m doubtful that this is just a power-saving idle state. I also expect that if that were the case I would have seen it kick on at some point by now.

Back on the 515.65 drivers; the fans are not enumerated correctly:

$ nvidia-settings -q fans --verbose

3 Fans on XXXX:1

    [0] XXXX:1[fan:0] (Fan 0)

      Has the following name:
        FAN-0

      Is connected to the following GPU:
        XXXX:1[gpu:1] (NVIDIA RTX A6000)

    [1] XXXX:1[fan:1] (Fan 1)

      Has the following name:
        FAN-1

      Is not connected to any GPU.

    [2] XXXX:1[fan:2] (Fan 2)

      Has the following name:
        FAN-2

      Is connected to the following GPU:
        XXXX:1[gpu:0] (NVIDIA GeForce RTX 3060)

So that’s definitely a problem in the driver. However, I can still manually set the fan speeds using their correct IDs.

After putting some load on the GPUs, and letting them get over 60C despite how uneasy that makes me feel, it does in fact appear that the fan speed on the 3060 idles at 0 while it’s under 60C, and both it and the A6000 do automatically spin up their fans under load.

I’m unhappy about this, as I would very much prefer to keep the idle temp under 40 (which is very possible with even 30% fan), and it bothers me that there doesn’t appear to be any setting to control minimum/idle fan speed, but I guess I can live with it.

To NVIDIA devs: letting gpu fans idle at 0% is at the very least unnerving, as at a glance it’s indistinguishable from the fan not working at all (especially when another gpu in the same system idles at 30%). Idling at 30% would at least provide immediate assurance that the fan does in fact work.