One of my GPUs is just showing !ERR in all fields of nvidia-smi. Also i cant kill the running process and also not reset the GPU. Rebooting solved this issue for a moment but on the third run it happend again with the same GPU.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54 Driver Version: 510.54 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 ERR! On | 00000000:01:00.0 Off | ERR! |
|ERR! ERR! ERR! ERR! / ERR! | 40949MiB / 46068MiB | ERR! Default |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:25:00.0 Off | 0 |
| 0% 36C P8 32W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A40 On | 00000000:41:00.0 Off | 0 |
| 0% 35C P8 32W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A40 On | 00000000:61:00.0 Off | 0 |
| 0% 33C P8 29W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A40 On | 00000000:81:00.0 Off | 0 |
| 0% 34C P8 30W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A40 On | 00000000:A1:00.0 Off | 0 |
| 0% 32C P8 28W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A40 On | 00000000:C1:00.0 Off | 0 |
| 0% 33C P8 30W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A40 On | 00000000:E1:00.0 Off | 0 |
| 0% 31C P8 28W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 17960 C 40947MiB |
+-----------------------------------------------------------------------------+
nvidia-smi --gpu-reset -i 0
The following GPUs could not be reset:
GPU 00000000:01:00.0: Unknown Error
Anyone having an idea whats happening and how to solve it?