GPU Lost when using Tensorflow Training

Dear all,

I’m using two Tesla V100 cards in a Tyan FT77A-B7059 GPU server (PCIe 3.0, updated BIOS 1.05 which at least supports K80). While a test with gpu_burn shows that the 250W electrical power is provided without problems, and airflow is sufficient to cool it to 82 degC, I run into trouble as soon as I train NN with Tensorflow (e.g. DeepLearningFrameworks/Tensorflow_MultiGPU.ipynb at master · ilkarman/DeepLearningFrameworks · GitHub):

nvidia-smi sporadically reports “GPU lost” / Xid 79 (see also attached nvidia-bug-log):

Mar 1 16:11:40 ml-comp kernel: NVRM: Xid (PCI:0000:89:00): 79, GPU has fallen off the bus.

I software reboot is not sufficient to bring back the card, but a full power down is needed. I tried the usual hints to change or replug power and PCI slots, but without success. Thermal problems shouldn’t be responsible, as only a temperature around 60-70 degC are reported during the incidence and gpu_burn worked flawlessly.

I’m using CentOS 7.6, with Nvidia Driver 410.79, and Tensorflow 1.19 in a Docker Container using nvidia-docker2 runtime.

Thanks for any suggestion in advance! H.

UPDATE: limiting the power to 100W apparently avoids the problem as first test show…