Not seeing all the GeForce RTX 2080 Ti GPUs when running nvidia-smi in Ubuntu 18.04 LTS Server

We have multiple server with 8 GeForce RTX 2080 Ti in each SuperMicro servers. We have about 7 of these server each with eight RTX 2080 Ti GPUs in them. We started noticing problems with some that the GPU disappears in nvidia-smi without any reason. Did an lshw -c display and it displayed all the GPUs. So they are recognized in the OS but not in nvidia-smi:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:1A:00.0 Off | N/A |
| 36% 35C P8 8W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 208… On | 00000000:3D:00.0 Off | N/A |
| 31% 33C P8 19W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 GeForce RTX 208… On | 00000000:3E:00.0 Off | N/A |
| 30% 35C P8 4W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 GeForce RTX 208… On | 00000000:88:00.0 Off | N/A |
| 31% 32C P8 11W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 4 GeForce RTX 208… On | 00000000:89:00.0 Off | N/A |
| 31% 33C P8 2W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 5 GeForce RTX 208… On | 00000000:B1:00.0 Off | N/A |
| 30% 33C P8 14W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 6 GeForce RTX 208… On | 00000000:B2:00.0 Off | N/A |
| 31% 34C P8 8W / 250W | 0MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

I would say it’s a bad GPU and send back for RMA but this happens quite a bit. Any suggestions?
nvidia-bug-report.log.gz (3.69 MB)

Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/

Added the NVidia-bug-report.log.gz
nvidia-bug-report.log.gz (3.69 MB)

Looks like a HW failure:

[   41.089112] NVRM: RmInitAdapter failed! (0x26:0x65:1106)
[   41.089170] NVRM: rm_init_adapter failed for device bearing minor number 1

Check it in another system, then RMA.
2080Tis are delicate.