RTX 4090 Fan state says "ERR!", performance is throttled

Description

I recently upgraded to an RTX 4090, and it has been getting into a bad state characterized by the following:

  • The “Fan” state in nvidia-smi reads “ERR!”
  • The performance node is stuck at P0.
  • The GPU utilization in nvidia-smi stays at 0%.
  • It is still possible to run programs on the GPU, but performance seems to be severely throttled (~10x slowdown).
  • The GPU fans never turn on (based on visual observation).

The GPU often enters this bad state on boot (roughly 60% of the time). A couple of times, the GPU has booted in a healthy state and later entered this bad state. The only way that I have found to recover from the bad state is to reboot.

Environment

  • Ubuntu 22.04
  • Driver version 535.86.10

Things I have tried so far (all exhibited the same symptoms):

  • Ubuntu 20.04
  • Downgrading to driver version 525.*
  • Re-seating the card.
  • Tested both the Quiet and Normal VBIOS modes offered by my Gigabyte RTX 4090.
  • Ran gpu-burn (GitHub - wilicc/gpu-burn: Multi-GPU CUDA stress test) for 2 mins when the card was in a good state. This pushed the power draw to 450+ W and the fans to ~70% for a couple minutes. This did not induce the bad state.

Running nvidia-smi -q reveals additional errors when the GPU/driver is in the bad state.

Bad:

Healthy:

nvidia-bug-report.log.gz (417.9 KB)

Exact same error on an RTX 3090. Following for updates. The only difference is that my GPU gets stuck in the P5 state, unlike your P0. I am running two RTX 3090s.

To follow up, I solved this by upgrading my motherboard (ASUS PRIME X470-PRO) BIOS to the latest version.