Nvidia-smi is giving GPU is lost on ubuntu-16.04 with AMD Ryzen 7 2700X CPU, X470 Motherboard and two 1080 Ti GPUs

Hi I am getting “Unable to determine the device handle for GPU 0000:0B:00.0: GPU is lost. Reboot the system to recover this GPU” when I am doing nvidia-smi. Its a ubuntu 16.04 system with two GeForce GTX 1080 Ti, AMD Ryzen 7 2700X octacore CPU on X470 Motherboard.
I ran “nvidia-bug-report.sh” and am attaching the generated “nvidia-bug-report.log”. I am having a Xid 79. Searching the forum, this might mean overheating or power problem. I ran ‘inxi -b’ and see that the two cards are getting detected as follows.
–=====
Graphics: Card-1: NVIDIA GP102 [GeForce GTX 1080 Ti]
Card-2: NVIDIA GP102 [GeForce GTX 1080 Ti]
Display Server: N/A driver: nvidia tty size: 158x47 Advanced Data: N/A out of X
–=====
If I do ‘nvidia-smi -i 0’ the first card information is shown. I am giving that output too.
–=====
±----------------------------------------------------------------------------+
| NVIDIA-SMI 415.27 Driver Version: 415.27 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 00000000:0A:00.0 On | N/A |
| 23% 37C P8 9W / 250W | 36MiB / 11175MiB | 0% Default |
±------------------------------±---------------------±---------------------
----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1739 G /usr/lib/xorg/Xorg 33MiB |
±----------------------------------------------------------------------------+
-=====
I just downgraded the driver to check if it resolves the issue.

I will have access to the server room where it stays (its not a headless server though) tomorrow as now the office time is over. However, is there a way to check if the card is dead? I mean even by physical examination, is it possible to check if the card is dead/damaged due to overheating? I was running some pytorch code and the temperature does not go generally beyond 87 C. It mainly remains in 84-85 C range. Is there a way to check what was the last recorded temperature?
Any pointer or help will be good to have.

Many thanks,
Abirnvidia-bug-report.log (3.2 MB)

A simple reinstall of the cards in the slots mitigated the issue.