nvidia-bug-report.log.gz (5.1 MB)
When running nvidiai-smi, gpu 3 will show an ERR.
I reinstalled the same version driver after server reboot and am using it. The latest driver could not be installed because the user said no.
Please see nvidia-bug-report. log for reference.
gpu 00000000:ca:00.0 (gpu 3)
Product Name : Unknown Error
GPU Model
GPU 0 : NVIDIA A100 80GB PCIe
GPU 1 : NVIDIA A100 80GB PCIe
GPU 2 : NVIDIA A100 80GB PCIe
GPU 3 : NVIDIA A100 80GB PCIe
We are monitoring to see if it recurs after reinstalling the driver.
Please check if the ERR is due to driver issue or hardware issue.
Lenovo says there is no GPU problem.
Mode : LENOVO SR650 V2
gpu : a100(80g) x 4
OS : RHEL8.3
Driver : 515.65.01 / CUDA : 11.7