Could not recognize one A100 card of two cards by nvidia-smi

Hi NV Team,

Please give me advice to resolve the issue.
Issue:
Suddenly our server did not recognized one A100 cards of two ones by nvidia-smi after restarting server.
Note:
lspic command showed the two cards like below.

nvidia@a100-server-b:~$ lspci | grep NVIDIA
24:00.0 3D controller: NVIDIA Corporation Device 20f1 (rev a1)
e1:00.0 3D controller: NVIDIA Corporation Device 20f1 (rev a1)
nvidia@a100-server-b:~$ nvidia-smi
Tue Jun 15 01:34:28 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000000:24:00.0 Off |                    0 |
| N/A   39C    P0    38W / 250W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
nvidia@a100-server-b:~$ uname -r
4.15.0-144-generic
nvidia@a100-server-b:~$ cat /etc/os-release 
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
nvidia@a100-server-b:~$  

If you need more information, please let me know.
Best regards.
Kaka

Close this issue…