NVIDIA-SMI Shows ERR! on both Fan and Power Usage

Hi, I installed a 2080 Ti and run several DL jobs on it. It turns out that after 20 minutes or so it always froze the system and have ERR! shown in both the Fan and PowerUsage from the nvidia-smi. I have driver version 415.18 and running on CUDA 9.2. Any idea what’s going on?

My training first shows:
RuntimeError: cuda runtime error (73) : an illegal instruction was encountered

Then nvidia-smi becomes:
Sat Dec 8 17:19:57 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 415.18 Driver Version: 415.18 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:09:00.0 On | N/A |
| 23% 34C P8 10W / 250W | 142MiB / 12196MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 108… Off | 00000000:0A:00.0 Off | N/A |
| 23% 38C P8 9W / 250W | 2MiB / 11178MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX 108… Off | 00000000:42:00.0 Off | N/A |
| 23% 37C P8 9W / 250W | 2MiB / 11178MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 GeForce RTX 208… Off | 00000000:43:00.0 Off | N/A |
|ERR! 52C P2 ERR! / 300W | 1MiB / 10986MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1526 G /usr/lib/xorg/Xorg 57MiB |
| 0 1571 G /usr/bin/gnome-shell 82MiB |
±----------------------------------------------------------------------------+

nvidia-bug-report.log.gz (1.77 MB)

Cuda 9.2 and an RTX gpu don’t go well together. Try with cuda 10, if possible.

But it seems like nvidia-smi gives CUDA verion 10.0. Do you think CUDA10 is still the problem?

Changing into CUDA 10 still doesn’t work.

nvidia-smi is just displaying the cuda driver version, i.e the maximum supported cuda version, not the installed cuda version.
Since it also fails with cuda 10, maybe check the card for a hw failure using cuda-memtest.

@gy46 did you figure out what the problem was? I’m having the same exact issue. The sequence of nvidia-smi outputs are:

Sun Jan  6 17:15:44 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:06:00.0 Off |                  N/A |
| 62%   78C    P2   256W / 260W |   9995MiB / 10989MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     14277      C   ...se/build/examples/openpose/openpose.bin  9985MiB |
+-----------------------------------------------------------------------------+

Sun Jan  6 17:16:19 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:06:00.0 Off |                  N/A |
| 52%   63C    P8    35W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Sun Jan  6 17:19:18 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:06:00.0 Off |                  N/A |
|ERR!   55C    P0   ERR! / 260W |     23MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3518      C   -                                             13MiB |
+-----------------------------------------------------------------------------+

Same problem, both dl training and playing game can cause this problem.
By the way my gpu mem is Micron and I’m using a 2-psu system.

I also have problem with this. Mine is the supermicro with 8 GTX 1080 Ti GPUs running on driver 410.78

Please advise what I should do to fix this. The Gpus report Fan as ERR! and watt also ERR! It sometimes happens and the only way to fix at the moment is to reboot the machine, but we have a lot of ongoing jobs on the server, which makes us reluctant to reboot the server. This problem cause extreme latency and delay and lags for all processes currently.

See the logs below, the Fan for GPU2 has died.

Sat Jan 26 00:01:25 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… On | 00000000:1A:00.0 Off | N/A |
| 34% 57C P2 168W / 250W | 11097MiB / 11178MiB | 78% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 108… On | 00000000:1B:00.0 Off | N/A |
| 38% 61C P2 213W / 250W | 10989MiB / 11178MiB | 80% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce GTX 108… On | 00000000:3D:00.0 Off | N/A |
|ERR! 40C P2 ERR! / 250W | 1748MiB / 11178MiB | 95% Default |
±------------------------------±---------------------±---------------------+
| 3 GeForce GTX 108… On | 00000000:3E:00.0 Off | N/A |
| 38% 55C P2 170W / 250W | 7729MiB / 11178MiB | 98% Default |
±------------------------------±---------------------±---------------------+
| 4 GeForce GTX 108… On | 00000000:88:00.0 Off | N/A |
| 23% 26C P8 9W / 250W | 11097MiB / 11178MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 5 GeForce GTX 108… On | 00000000:89:00.0 Off | N/A |
| 24% 42C P2 77W / 250W | 10631MiB / 11178MiB | 79% Default |
±------------------------------±---------------------±---------------------+
| 6 GeForce GTX 108… On | 00000000:B1:00.0 Off | N/A |
| 32% 48C P2 157W / 250W | 10987MiB / 11178MiB | 88% Default |
±------------------------------±---------------------±---------------------+
| 7 GeForce GTX 108… On | 00000000:B2:00.0 Off | N/A |
| 28% 48C P2 134W / 250W | 10839MiB / 11178MiB | 97% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 90049 C …e/thomas/anaconda3/envs/py36/bin/python 11085MiB |
| 1 15156 C /usr/bin/python 10975MiB |
| 2 19494 C python 457MiB |
| 2 27366 C python 383MiB |
| 2 63450 C python 449MiB |
| 2 89990 C python 457MiB |
| 3 68712 C python 7719MiB |
| 4 45411 C …e/thomas/anaconda3/envs/py36/bin/python 11085MiB |
| 5 77979 C python 10621MiB |
| 6 53147 C /usr/bin/python 10973MiB |
| 7 29612 C /usr/bin/python 9911MiB |
| 7 45754 C python 461MiB |
| 7 58107 C python 457MiB |
±----------------------------------------------------------------------------+

Hi, have you solved this problem? Many thanks!

Hi, I still have that issue. Please help. It happens sometimes.

Hi how frequent does this happen? For me, it happens almost every week.

I spoke with the specialist. And he told me to reseat the GPUs. Since I have 3 cards, he told me to do it for all of them.

I will try tomorrow. But in the meantime if you also try, and see any difference could you please let me know? or if you find any other solution.

Thanks.

The issue still exists. I tried reseating the card, it doesn’t solve the problem.

Make sure to disable iommu in bios. If issue persists, use gpu-burn to test your hardware and post the results.

This issue is due to the higher temperature.

First, you should reseat the question card to the coolest location of your workstation.

Second, set the power limitation [1] and fan speed [2] to ensure the peak temperature does not exceed 75C.

[1] Change them to 150W-to-200W
sudo nvidia-smi -pm 1
sudo nvidia-smi -pl 150(200)

[2] https://github.com/boris-dimitrov/set_gpu_fans_public

Using these methods, I have restored two 1080ti cards which have the same issues.

Hi mb55407, thanks for your message.

In my case the problem is not due to the card, it is due to the location. The one in the middle always shows this error. Also the temperature seems to be fine.

When you were getting this error, was the temperature higher than 75C?

Yes, after running a task with 90C for 2 days, my second card which has the highest temperature got this error message.

I switch the location of my second and fourth card and sets the above temperature control solutions, and then this message never appears again.

The same thing happens on my Titan V, but only after ~6 hours of training at ~85C. I’ll try underwatting it to 150W and holding it at ~77C. I’m using driver version 415.27 and CUDA 10, along with a 2080 Ti and Titan RTX in the same system.

That solution doesn’t work for me – the card ERR’d again.

sudo grep NVRM /var/log/messages

If XID Error 62 exists, your card has broken.

I’m wondering if anyone was able to solve this.
We’re getting similar issues on an 8-GPU supermicro system, even when only using 3 GPUs. Issue seems to occur whenever we have two cards next to each other in the slots.
The same machine is super stable with GTX Titan X. It was crashing every few hours when we tried swapping to Titan Xp, and now we get a similar behavior with RTX 2080 Ti. We didn’t try doing a thorough troubleshooting with the Xp, but with the 2080 Ti we are now trying to narrow down the issues.
It does not seem to be the cards per se as the same card can perform well in a slot, and misbehave in another slot. Typical misbehavior is: for one or two of the 3 GPUs, GPU Power Draw ramps up to near 200W, but within a few seconds gets back down to 100W, with “SW Thermal Slowdown” reported by nvidia-smi, fan speed getting to ~70% and sometimes giving Fan ERR!, and GPU Temp getting over 90C. Meanwhile, one of the 3 cards (sometimes 2) behaves perfectly fine, fan speed stays around 40%, temp around 70C, power draw over 200W.
gpu_burn does not report any issue with the cards.
NVIDIA driver version is 410.79.