NVIDIA-SMI Shows ERR! on both Fan and Power Usage

gy46 · December 8, 2018, 10:08pm

Hi, I installed a 2080 Ti and run several DL jobs on it. It turns out that after 20 minutes or so it always froze the system and have ERR! shown in both the Fan and PowerUsage from the nvidia-smi. I have driver version 415.18 and running on CUDA 9.2. Any idea what’s going on?

My training first shows:
RuntimeError: cuda runtime error (73) : an illegal instruction was encountered

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1526 G /usr/lib/xorg/Xorg 57MiB |
| 0 1571 G /usr/bin/gnome-shell 82MiB |
±----------------------------------------------------------------------------+

nvidia-bug-report.log.gz (1.77 MB)

generix · December 8, 2018, 10:23pm

Cuda 9.2 and an RTX gpu don’t go well together. Try with cuda 10, if possible.

gy46 · December 8, 2018, 10:45pm

But it seems like nvidia-smi gives CUDA verion 10.0. Do you think CUDA10 is still the problem?

gy46 · December 8, 2018, 10:51pm

Changing into CUDA 10 still doesn’t work.

generix · December 8, 2018, 11:14pm

nvidia-smi is just displaying the cuda driver version, i.e the maximum supported cuda version, not the installed cuda version.
Since it also fails with cuda 10, maybe check the card for a hw failure using cuda-memtest.

michaizl0j · January 6, 2019, 10:45pm

@gy46 did you figure out what the problem was? I’m having the same exact issue. The sequence of nvidia-smi outputs are:

Sun Jan  6 17:15:44 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:06:00.0 Off |                  N/A |
| 62%   78C    P2   256W / 260W |   9995MiB / 10989MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     14277      C   ...se/build/examples/openpose/openpose.bin  9985MiB |
+-----------------------------------------------------------------------------+

Sun Jan  6 17:16:19 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:06:00.0 Off |                  N/A |
| 52%   63C    P8    35W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Sun Jan  6 17:19:18 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:06:00.0 Off |                  N/A |
|ERR!   55C    P0   ERR! / 260W |     23MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3518      C   -                                             13MiB |
+-----------------------------------------------------------------------------+

454870810 · January 13, 2019, 11:05am

Same problem, both dl training and playing game can cause this problem.
By the way my gpu mem is Micron and I’m using a 2-psu system.

nxphi47 · January 26, 2019, 6:35am

I also have problem with this. Mine is the supermicro with 8 GTX 1080 Ti GPUs running on driver 410.78

Please advise what I should do to fix this. The Gpus report Fan as ERR! and watt also ERR! It sometimes happens and the only way to fix at the moment is to reboot the machine, but we have a lot of ongoing jobs on the server, which makes us reluctant to reboot the server. This problem cause extreme latency and delay and lags for all processes currently.

See the logs below, the Fan for GPU2 has died.

NaoRobot · February 15, 2019, 3:32pm

Hi, have you solved this problem? Many thanks!

nxphi47 · February 15, 2019, 5:49pm

Hi, I still have that issue. Please help. It happens sometimes.

NaoRobot · February 15, 2019, 6:36pm

Hi how frequent does this happen? For me, it happens almost every week.

I spoke with the specialist. And he told me to reseat the GPUs. Since I have 3 cards, he told me to do it for all of them.

I will try tomorrow. But in the meantime if you also try, and see any difference could you please let me know? or if you find any other solution.

Thanks.

gy46 · February 15, 2019, 6:54pm

The issue still exists. I tried reseating the card, it doesn’t solve the problem.

generix · February 15, 2019, 7:03pm

Make sure to disable iommu in bios. If issue persists, use gpu-burn to test your hardware and post the results.

mb55407 · February 19, 2019, 8:03am

This issue is due to the higher temperature.

First, you should reseat the question card to the coolest location of your workstation.

Second, set the power limitation [1] and fan speed [2] to ensure the peak temperature does not exceed 75C.

[1] Change them to 150W-to-200W
sudo nvidia-smi -pm 1
sudo nvidia-smi -pl 150(200)

[2] https://github.com/boris-dimitrov/set_gpu_fans_public

Using these methods, I have restored two 1080ti cards which have the same issues.

NaoRobot · February 19, 2019, 2:46pm

Hi mb55407, thanks for your message.

In my case the problem is not due to the card, it is due to the location. The one in the middle always shows this error. Also the temperature seems to be fine.

When you were getting this error, was the temperature higher than 75C?

mb55407 · February 20, 2019, 2:51am

Yes, after running a task with 90C for 2 days, my second card which has the highest temperature got this error message.

I switch the location of my second and fourth card and sets the above temperature control solutions, and then this message never appears again.

rtang123 · February 20, 2019, 8:36am

The same thing happens on my Titan V, but only after ~6 hours of training at ~85C. I’ll try underwatting it to 150W and holding it at ~77C. I’m using driver version 415.27 and CUDA 10, along with a 2080 Ti and Titan RTX in the same system.

rtang123 · February 20, 2019, 5:31pm

That solution doesn’t work for me – the card ERR’d again.

mb55407 · February 21, 2019, 9:02am

sudo grep NVRM /var/log/messages

If XID Error 62 exists, your card has broken.

EarElephant · April 10, 2019, 1:52am

I’m wondering if anyone was able to solve this.
We’re getting similar issues on an 8-GPU supermicro system, even when only using 3 GPUs. Issue seems to occur whenever we have two cards next to each other in the slots.
The same machine is super stable with GTX Titan X. It was crashing every few hours when we tried swapping to Titan Xp, and now we get a similar behavior with RTX 2080 Ti. We didn’t try doing a thorough troubleshooting with the Xp, but with the 2080 Ti we are now trying to narrow down the issues.
It does not seem to be the cards per se as the same card can perform well in a slot, and misbehave in another slot. Typical misbehavior is: for one or two of the 3 GPUs, GPU Power Draw ramps up to near 200W, but within a few seconds gets back down to 100W, with “SW Thermal Slowdown” reported by nvidia-smi, fan speed getting to ~70% and sometimes giving Fan ERR!, and GPU Temp getting over 90C. Meanwhile, one of the 3 cards (sometimes 2) behaves perfectly fine, fan speed stays around 40%, temp around 70C, power draw over 200W.
gpu_burn does not report any issue with the cards.
NVIDIA driver version is 410.79.

Topic		Replies	Views
2080Ti got ERR soon after starting DL training Linux	11	1745	February 2, 2019
Incorrect power management with PRIME configuration Linux	25	4180	September 6, 2022
nvidia-smi power limit on GTX 1060 Linux	61	53376	February 13, 2018
nvidia 387.12 breaks power reading in nvidia-smi. Linux	27	12512	March 30, 2021
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	61956	February 14, 2021
`nvidia-smi -q` shows several "Unknown Error"; GPU ignored by pytorch Linux cuda , ubuntu	3	1851	September 6, 2023
Painfully long driver initialization with many GPUs -- affects ALL drivers (Nvidia, please do someth... Linux	38	5314	October 12, 2021
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64027	April 20, 2011
Driver issue on Ubuntu 19.10 Linux ubuntu	16	4155	April 5, 2020
NVIDIA driver has a habit of keeping my GPU at the highest performance level Linux	23	5952	July 8, 2022

NVIDIA-SMI Shows ERR! on both Fan and Power Usage

Related topics