Nvidia RTXA5000 NVIDIA-SMI 510.47.03, Driver Version: 510.47.03, CUDA Version: 11.6, Ubuntu 20.04.5 Fan ERR!

We have a server with 4 GPUs RTXA5000 running on Ubuntu 20.04.5 LTS with Driver Version: 510.47.03, CUDA Version: 11.6. They are running on chassis SuperMicro sys 420 gp-tnr.
Now, they have some issues I cant resolve:

  1. Those GPU cards were allocated at PCIe slot 3, 5, 7, 9. When I finished install Nvidia driver, I only saw 3 GPUs card.
  2. When we ran a training process or running with out workload, It shown:
    root@lab:~# nvidia-smi
    Tue Nov 29 17:15:54 2022
    ±----------------------------------------------------------------------------+
    | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
    |-------------------------------±---------------------±---------------------+
    | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    | | | MIG M. |
    |===============================+======================+======================|
    | 0 NVIDIA RTX A5000 On | 00000000:57:00.0 Off | Off |
    |ERR! 29C P8 4W / 230W | 13MiB / 24564MiB | 0% Default |
    | | | N/A |
    ±------------------------------±---------------------±---------------------+
    | 1 NVIDIA RTX A5000 On | 00000000:D1:00.0 Off | Off |
    | 30% 27C P8 16W / 230W | 13MiB / 24564MiB | 0% Default |
    | | | N/A |
    ±------------------------------±---------------------±---------------------+
    | 2 NVIDIA RTX A5000 On | 00000000:D2:00.0 Off | Off |
    |ERR! 29C P8 5W / 230W | 13MiB / 24564MiB | 0% Default |
    | | | N/A |
    ±------------------------------±---------------------±---------------------+
    | 3 NVIDIA RTX A5000 On | 00000000:D5:00.0 Off | Off |
    | 30% 28C P8 17W / 230W | 13MiB / 24564MiB | 0% Default |
    | | | N/A |
    ±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2098 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 5494 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2098 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 5494 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2098 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 5494 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2098 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 5494 G /usr/lib/xorg/Xorg 4MiB |
±----------------------------------------------------------------------------+
Then we can’t use those cards and must reboot system.