Nvidia-smi showing !ERR in all fields for one of the GPUs(A40)

One of my GPUs is just showing !ERR in all fields of nvidia-smi. Also i cant kill the running process and also not reset the GPU. Rebooting solved this issue for a moment but on the third run it happend again with the same GPU.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  ERR!                On   | 00000000:01:00.0 Off |                 ERR! |
|ERR!  ERR! ERR!    ERR! / ERR! |  40949MiB / 46068MiB |    ERR!      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40          On   | 00000000:25:00.0 Off |                    0 |
|  0%   36C    P8    32W / 300W |      2MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A40          On   | 00000000:41:00.0 Off |                    0 |
|  0%   35C    P8    32W / 300W |      2MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A40          On   | 00000000:61:00.0 Off |                    0 |
|  0%   33C    P8    29W / 300W |      2MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A40          On   | 00000000:81:00.0 Off |                    0 |
|  0%   34C    P8    30W / 300W |      2MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A40          On   | 00000000:A1:00.0 Off |                    0 |
|  0%   32C    P8    28W / 300W |      2MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A40          On   | 00000000:C1:00.0 Off |                    0 |
|  0%   33C    P8    30W / 300W |      2MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A40          On   | 00000000:E1:00.0 Off |                    0 |
|  0%   31C    P8    28W / 300W |      2MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     17960      C                                   40947MiB |
+-----------------------------------------------------------------------------+

nvidia-smi --gpu-reset -i 0
The following GPUs could not be reset:
  GPU 00000000:01:00.0: Unknown Error

Anyone having an idea whats happening and how to solve it?

It might be overheating due to some airflow problem, please monitor temperatures. Otherwise, it might be broken.

I have the little problem that at some point in the test nvidia-smi doesnt seem to be callable and just hangs at the command. Tried also locking to a file and just querying gpu temp but the file is empty.

Do you know another way to lock gpu temps? lm-sensors doesnt seem to catch them. Running Ubuntu 20.04.

The usual command would be
nvidia-smi q -dTEMPERATURE -l2 -f nv-temp.log
which runs a 2 second loop logging to file until it breaks.

Had to change your command a little cause it was invalid.

nvidia-smi --query-gpu=temperature.gpu -l 2 -i 0 --format=csv -f ~/nv-temp.log

Sadly that didnt solve the problem since it apperently doesnt write constantly to the file but when terminating.
But i can terminate it correctly cause it hangs so it never writes to the file.

I tried to solve this with a bash loop

#!/bin/bash

while sleep 1
do
	nvidia-smi --query-gpu=temperature.gpu -i $1 --format=csv >> $2
done

The reports that after a few seconds of load already the error occurs, the last temperature it records is 69C, then it takes really long to add lines that just report an error.

temperature.gpu
69
temperature.gpu
[Unknown Error]

So i guess the card is broken?

Thanks for all the help!
cheers
Kilian

I’d say yes, it’s broken.