GPU 0000:0B:00.0: GPU is lost when running pytorch program for CNN training

We have a HP-DL580-G7 server with four 1200W power supply operating at 95% efficiency (4560W of usable power). We have 4 CPUs, 128GB of RAM, 3 HDDS, 6 GeForce GTX 1080 Ti GPUs from EVGA. 4 GPUs are inside the server, and in the past 2 GPUs were outside with extended CPIe cables. We always had issues with the GPU device 3 being lost when all 6 GPUs were enabled, so we decided to continue with only 4 for now. But this morning, when running training for YOLOv2 VGG16 neural network, we lost the GPU device 0 at epoch 0. Temperature and memory usage were monitored during the process, no issues should be there. I attached the log file. Thank you.
nvidia-bug-report.log.gz (2.2 MB)