Description
I recently upgraded to an RTX 4090, and it has been getting into a bad state characterized by the following:
- The “Fan” state in nvidia-smi reads “ERR!”
- The performance node is stuck at P0.
- The GPU utilization in nvidia-smi stays at 0%.
- It is still possible to run programs on the GPU, but performance seems to be severely throttled (~10x slowdown).
- The GPU fans never turn on (based on visual observation).
The GPU often enters this bad state on boot (roughly 60% of the time). A couple of times, the GPU has booted in a healthy state and later entered this bad state. The only way that I have found to recover from the bad state is to reboot.
Environment
- Ubuntu 22.04
- Driver version 535.86.10
Things I have tried so far (all exhibited the same symptoms):
- Ubuntu 20.04
- Downgrading to driver version 525.*
- Re-seating the card.
- Tested both the Quiet and Normal VBIOS modes offered by my Gigabyte RTX 4090.
- Ran gpu-burn (GitHub - wilicc/gpu-burn: Multi-GPU CUDA stress test) for 2 mins when the card was in a good state. This pushed the power draw to 450+ W and the fans to ~70% for a couple minutes. This did not induce the bad state.
Running nvidia-smi -q
reveals additional errors when the GPU/driver is in the bad state.
Bad:
Healthy:
nvidia-bug-report.log.gz (417.9 KB)