Multiple Servers With A6000 Presenting ERR! for Fan RPM at random

Hi,
We are seeing an issue with multiple systems using multiple A6000 (Ampere) per system exhibiting a random behavior where the GPU FAN column presents an ERR! message. Once this happens the GPU is unusable and non-responsive to commands. As a workaround you can reboot the system and it may work for 10 min to 3 days. Under full load or no load this persists.
We updated the kernel, GPU drivers, and vGPU .

We also built a newer system that several months newer ( in terms of ordering parts) and this does not, at least so far, present the symptoms.

I do see older posts with a similar issue, seems most of them go un-resolved and a couple seem to be fixed by random changes.

Appreciate any feedback and can provide more info as needed.

( below is an example, the above updates have been updated )
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:01:00.0 Off | 0 |
|ERR! 29C P8 9W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA RTX A6000 On | 00000000:21:00.0 Off | 0 |
|ERR! 28C P8 8W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 2 NVIDIA RTX A6000 On | 00000000:41:00.0 Off | 0 |
|ERR! 29C P8 13W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 3 NVIDIA RTX A6000 On | 00000000:61:00.0 Off | 0 |
|ERR! 28C P8 12W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 4 NVIDIA RTX A6000 On | 00000000:81:00.0 Off | 0 |
| 30% 25C P8 18W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 5 NVIDIA RTX A6000 On | 00000000:A1:00.0 Off | 0 |
| 30% 25C P8 12W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 6 NVIDIA RTX A6000 On | 00000000:C1:00.0 Off | 0 |
| 30% 25C P8 7W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 7 NVIDIA RTX A6000 On | 00000000:E1:00.0 Off | 0 |
| 30% 24C P8 7W / 300W | 2MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

2 Likes