My server currently has two Nvidia A30 GPUs. I’ve updated to the latest driver and CUDA version, but when I start the machine, both GPUs initially show a temperature of around 50°C. However, within a few minutes, the temperature of one GPU gradually increases to 89°C (~100W).
I’ve tried swapping slots, changing PCIe cables, updating the driver, and more, but the issue remains unresolved. It seems like a hardware-related problem.
Has this situation only occured due to a change in driver or has the hardware changed?
If the latter, be aware that these cards have no fans and are designed to be installed in enclosures that provide adequate airflow from external fans. That said, it does seem strange for card 0 to be using 102W, with 0% load, unless damaged.
Thank you for your replied.
We recently started using both GPUs; previously, we were only using one GPU. When I noticed abnormal temperatures on GPU0, I tried the above methods to fix it.
The A30 does not have a cooling fan, meaning that if the temperature rises, it is due to increased power consumption, even though no processes are being executed. I also believe this is a hardware issue.
You are using driver 570. I reported a small power consumption increase on unused GPU here: Increased idle consumption with driver 570 not sure it is related but it’s interesting.
Have you tried downgrading to driver 565 to see if the problem still happens?