GPU Lost when using Tensorflow Training

schr2718 · March 6, 2019, 2:24pm

Dear all,

I’m using two Tesla V100 cards in a Tyan FT77A-B7059 GPU server (PCIe 3.0, updated BIOS 1.05 which at least supports K80). While a test with gpu_burn shows that the 250W electrical power is provided without problems, and airflow is sufficient to cool it to 82 degC, I run into trouble as soon as I train NN with Tensorflow (e.g. DeepLearningFrameworks/Tensorflow_MultiGPU.ipynb at master · ilkarman/DeepLearningFrameworks · GitHub):

nvidia-smi sporadically reports “GPU lost” / Xid 79 (see also attached nvidia-bug-log):

Mar 1 16:11:40 ml-comp kernel: NVRM: Xid (PCI:0000:89:00): 79, GPU has fallen off the bus.

I software reboot is not sufficient to bring back the card, but a full power down is needed. I tried the usual hints to change or replug power and PCI slots, but without success. Thermal problems shouldn’t be responsible, as only a temperature around 60-70 degC are reported during the incidence and gpu_burn worked flawlessly.

I’m using CentOS 7.6, with Nvidia Driver 410.79, and Tensorflow 1.19 in a Docker Container using nvidia-docker2 runtime.

Thanks for any suggestion in advance! H.

UPDATE: limiting the power to 100W apparently avoids the problem as first test show…

Topic		Replies	Views
The GPU is always lost when I run TensorFlow program Linux	0	559	September 20, 2018
The GPU is always lost when I run TensorFlow program General Topics and Other SDKs	1	927	September 20, 2018
GPU occasionally gets lost when running Tensorflow. Linux	1	3341	January 21, 2019
GPU is lost during execution of either Tensorflow or Theano code CUDA Programming and Performance	12	12751	March 8, 2020
GPU 0000:0B:00.0: GPU is lost when running pytorch program for CNN training Linux cuda , nvbugs , python	0	406	July 14, 2021
GPU is lost, all GPU card fans on, 1080 Ti, Ubuntu 16.04. Linux	2	5555	January 3, 2018
GPU is lost when running Deep Learning codes on Ubuntu16.04 with two GTX TITAN X Linux	6	1762	February 8, 2018
Tensorflow freezes during training (Linux OS) CUDA Programming and Performance	1	1381	April 11, 2018
GPU Sporadically Falls Off Bus During Tensorflow Training Linux	2	652	May 3, 2021
Tesla K10 "has fallen off the bus" Linux	5	3268	May 13, 2013

GPU Lost when using Tensorflow Training

Related topics