NVIDIA-SMI Shows ERR! on both Fan and Power Usage

Please enable the nvidia-persistenced to start on boot and check if that resolves the issue.

nvidia-persistenced is already running.

You’ll have to take into account that most 2080Ti designs use axial fans opposed to most Titan X using radial fans having great impact on air flow.

I should have mentioned that we use ZOTAC Blower type cards.
We also have another virtually identical node, running the same system image and the same ZOTAC cards, where we do not get such behavior. Only difference we could spot was motherboard BIOS being 3.0b on node behaving well, and 3.1 on node behaving badly.

Hi, everyone,
I use GTX1080, driver version 415.27 and CUDA 9, I had face same problem these days. When I train a net for a few hours(the temperature is about 83C), the error occured and the system became so slowly that I had to restart it.
Today, I found the main cause of the problem might be overheating, because when I trained with the side panel of the computer case open, the temperature drop to 77C, I had completed training for about 16 hrs without errors.(PS: I also used nvidia settings to set the fun spend at 90% by hand)

We found out that the cards we had purchased from ZOTAC had an extra back plate which was increasing the width of the card just enough to prevent any air circulation between the stacked GPUs.
Removing this plate greatly improves the situation, although we do still notice some mild throttling (SW Thermal Slowdown) because of the fan speed limitation by default to around 53%.
By further using a script to control the fan speed that can work on headless nodes (we adapted https://github.com/boris-dimitrov/set_gpu_fans_public), we can reduce the GPU temperature to under 70C, no longer getting any throttling.
That being said, our node still crashes without any obvious error for a reason that we are still investigating, hopefully a bad card.

I had a very similar issue, where GPU FAN and Pwr:Usage/Cap showed ERR!
I could see that the GPU was pretty heated up, the temperature was 50C and 58C for GPU 0 and 1 respectively.
None of the methods helped me resolve it.
Instead I placed a small table fan with air flowing into the GPU’s and did a reboot after a few minutes. Not only did the GPU’s cool down (temp came back to 30, 32C), but the ERR was also gone and I was able to successfully use both the GPU’s again.

I hope this helps someone!

Had this issue as well without the temperature going very high. The card affected was one of my four GTX 1080TI, using CUDA 10. The solution was to reduce the max power from 280 to 260 with the following command:

sudo nvidia-smi --gpu=0 -pl 260

I have the same problem after running an deep learning application in the environment of Jupyter Notebook in the Miniconda 4.8.3.

GPU experienced a hot temperature of 80 Celsius degree during a dozen of minutes while running a deep learning sample project. And the it has the ERR.

It is probably incurred by the zombie GPU Memory or TensorFlow takes over the GPU memory and nvidia_uvm does not do anything after ending a deep learning application. Or in the worse case, it would be a GPU fan hardware problem.

For instance, the GPU process is shown with 2100 MiB / 7921MiB sometimes after completing the AlexNet model and ending Jupyter Noteboob and the Ubuntu Terminal . While continue inputting the command of $ nvidia-smi, the GPU process message is still no changed. It is a detailed or stayed processes.

After rebooting the system with the command of $ sudo reboot, there is ERR! in the GPU Fan section.

I tried to use the following composite commands. The commands sometimes worked and did not work the rest of the time.

1. Clear the system memory.

$ sudo reboot

2. Remove nvidia_uvm and reboot again

$ sudo rmmod nvidia_uvm
$ sudo modprobe nvidia_uvm
$ sudo reboot

I have already configured nvidia-persistenced.service. The GPU is a new purchaed. I do not know whether it is a CUDA driver(software) or the fan hardware problem. I never meet the problem in the other Nvidia RTX GPUs.

Please help indicate how to solve the problem permanently.

Error Messages:

$ nvidia-smi
Tue Aug 4 14:11:27 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 207
 On | 00000000:01:00.0 On | N/A |
|ERR! 34C P8 ERR! / 215W | 247MiB / 7981MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1138 G /usr/lib/xorg/Xorg 18MiB |
| 0 N/A N/A 1219 G /usr/bin/gnome-shell 49MiB |
| 0 N/A N/A 1509 G /usr/lib/xorg/Xorg 107MiB |
| 0 N/A N/A 1663 G /usr/bin/gnome-shell 69MiB |
±----------------------------------------------------------------------------+

I met with the issue of FAN ERR & Pwr:Usage/Cap ERR while I trained a small project for 20 minutes with the temperature reaching 80 Celsius degree. With regard to the ERR, I have the solutions listed as follows.

1. Solution to the ERR in the condition of no DNN training.

There is no GPU usage for any DNN at all, so I remove nvidia_uvm with the following composite commands.

$ sudo rmmod nvidia_uvm
$ sudo modprobe nvidia_uvm
$ sudo reboot

2. Slultion to the ERR in the condition of completing DNN training under Jupyter Notebook

After completing a DNN training, the Jupyter still shows the persisted GPU Memory usage such as 2100 MiB, and then it will show GPU FAN & Pwr:Usage/Cap ERR. I use the following way to erase the ERR.

1). Clear the GPU memory and remove nvidia_uvm

(1). Clear the GPU memory.

$ sudo reboot

(2). Remove nvidia_uvm

$ sudo rmmod nvidia_uvm
$ sudo modprobe nvidia_uvm
$ sudo reboot

Afterwards, the system recovers to the normal status temporarily.

2). And then insert the following code into the last one of the cells in Jupyter Notebook

from numba import cuda

cuda.select_device(0)
cuda.close()

After adopting the above-written methods, the ERR will be removed and the Ubuntu system keeps working in the normal status for a longer time.

However, my issue is that the ERR still emerges upon starting the system (while coming back to work in the morning or restarted the system sometimes).

I am figuring our a way to permanently remove the ERR. If someone has a better solution, it will be welcome.

Notes:

1. The old Nvidia CUDA Dirver(below 450.57) limits the fan speed

The old Nividia Driver limits the fan speed and then makes the GPU run in the high temperature with much lower fan speed. The newer CUDA Driver 450.57(or probably 450.56) has the MIG functionality that can grow the GPU fan speed that greatly speed the fan speed. Therefore, the working temperature of GPU decreases sharply.

2. gpu_burn and power cap has its internal limitation

While the GPU comes back to the normal status, gpu_burn does not help. And power cap adjustment is not helpful while setting up nvidia.persistenced.service.

I have the issue without any high temperatures or extensive use of the card. The issue also comes with poor graphics performance. Simply rebooting “solves” the problem, but it would be really nice to actually solve it. The current output of nvidia-smi that I get is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    Off  | 00000000:08:00.0  On |                  N/A |
|ERR!   63C    P5   ERR! / 170W |    908MiB /  5931MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Similar issue
Four RTX3090 GPU work together on Centos 7.6
cuda 11.0 V11.0.194
Driver version 450.51.05
(1).Run program with GPU 0(top), 2, 3(bottom), these gups work well (Even 30 mins)
(2)Run the same program with GPU 1, fan ERR shows(Within 5 mins).
(3)Run the same program with GPU 0, 1, 2, 3 simulataneously, GPU 0,2,3 still work well, and GPU1 fan ERR shows again(Within 5 mins).
The temperature of error gpu 1 is not too high, even lower than other gpu 0, 2, 3
The hightest temp and fan is shown in GPU3, 78% fan and 72C temp, but still work well.

Reboot seems reset the gpu status and make it work well for a little while for the poor gup 1, but the fan ERR! will shown later when running.


Same problem am facing frequently
can anyone suggest solution
thanks in advance