K80 GPU disappears when tries to run 2 TensorFlow applications (one in each GPU) simultaneously.

Hi Everybody.

When we tries to execute to instances of TensorFlow example in a K80 GPU in linux, the GPU generates a critical error and we have to shutdown and restart the server in order to recover them.
We made several test:

  1. Just one instance of the program without CUDA_VISIBLE_DEVICE variable set. Allocates both GPUs, only uses one and it works well
  2. Just one instance of the program with CUDA_VISIBLE_DEVICE variable set to 0 or 1 (indicating which GPU to use), only allocate and uses the indicated GPU. it works without problem
  3. Executing 2 instances of the sample program, setting CUDA_VISIBLE_DEVICE variable to 0 and 1, generate a lost of communication between the host and the GPU card. We have to shutdown the server and restart in order to recover the GPUY and the server.
  4. .

If anybody has any idea about what can happens, I will appreciate it so mouch.

Thanks in advance

H.
P.S: Include the error messages:

The operating system is Linux kernel 4.9.34, Gentoo distribution and the nvidia driver is 384.59.

The error showed is:

2017-07-31 16:33:04.815885: E tensorflow/stream_executor/cuda/cuda_driver.cc:1098] could not synchronize on CUDA context: CUDA_ERROR_UNKNOWN :: No stack trace available
2017-07-31 16:33:04.815980: F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed

The message in the console is:

[Jul31 16:33] NVRM: GPU at PCI:0000:83:00: GPU-0694dbcb-6cb3-c7bb-05f8-4185fe20d67c
[  +0.000018] NVRM: GPU Board Serial Number: 0323615069723
[  +0.000004] NVRM: Xid (PCI:0000:83:00): 79, GPU has fallen off the bus.
[  +0.000001] NVRM: GPU at 0000:83:00.0 has fallen off the bus.
[  +0.000015] NVRM: GPU is on Board 0323615069723.
[  +0.000011] NVRM: A GPU crash dump has been created. If possible, please run
              NVRM: nvidia-bug-report.sh as root to collect this data before
              NVRM: the NVIDIA kernel module is unloaded.
[  +0.000011] NVRM: GPU at PCI:0000:84:00: GPU-e405f17d-0c15-c2b1-eaf6-2f50a0c524ec
[  +0.000019] NVRM: GPU Board Serial Number: 0323615069723
[  +0.000002] NVRM: Xid (PCI:0000:84:00): 79, GPU has fallen off the bus.
[  +0.000001] NVRM: GPU at 0000:84:00.0 has fallen off the bus.
[  +0.000019] NVRM: GPU is on Board 0323615069723.

We setup the BIOS to recognise the GPU memory:

83:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 53
	NUMA node: 1
	Region 0: Memory at fa000000 (32-bit, non-prefetchable) 
	Region 1: Memory at 39f800000000 (64-bit, prefetchable) 
	Region 3: Memory at 39fc00000000 (64-bit, prefetchable) 
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee20000  Data: 4034
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
--
	Kernel driver in use: nvidia
	Kernel modules: nvidia_drm, nvidia

84:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 54
	NUMA node: 1
	Region 0: Memory at f9000000 (32-bit, non-prefetchable) 
	Region 1: Memory at 39f000000000 (64-bit, prefetchable) 
	Region 3: Memory at 39f400000000 (64-bit, prefetchable) 
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee20000  Data: 4044
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
--
	Kernel driver in use: nvidia
	Kernel modules: nvidia_drm, nvidia

ff:08.0 System peripheral: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0 (rev 01)
	Subsystem: Super Micro Computer Inc Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

ff:08.2 Performance counters: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0 (rev 01)
	Subsystem: Super Micro Computer Inc Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Kernel driver in use: bdx_uncore

ff:08.3 System peripheral: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0 (rev 01)
	Subsystem: Super Micro Computer Inc Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Capabilities: [40] Express (v1) Root Complex Integrated Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0
			ExtTag- RBE-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-

Why Gentoo may I ask; is Gentoo even officially supported by CUDA?

This better not be a /g/ meme.

are those K80’s installed in a server that was properly certified by the OEM for use with K80, and did those K80’s actually ship from that OEM in the server? (I see it is a SMC server.)

Proper power delivery and proper cooling airflow are two things necessary to keep tesla GPUs happy. For the cooling issue, I would monitor GPU temps up to the point at which they drop off the bus.

Why Gentoo? A long history (10 years of history…). We do not have problems with this distribution and have close to 100 Linux servers.

Optimising the O.S. to the specific CPU architecture, we gets up to 10% more speed comparing with CentOS 6.0, but I think that is another discussion.

We have another server with 2 cards (TITAN Xp and TITAN X) and works very well (with the same distribution, kernel, drivers and test program).

This problem occurs in two different servers, each one with one K80 GPU card, and configured identically. (SuperMicro SYS-2028TP-DTR, with X10DRT-P motherboard). That is why we think it is not a GPU Card hardware failure.

I forgot to comment: Running the test program with the cuda-memcheck program, and it do not fails.

Thanks in any case if you have any idea about why the K80 could fails under these conditions.

:-D

H.

Does the program fail if you use tf.device() to run another copy alongside in the same instance instead of setting CUDA_VISIBLE_DEVICE and run two instances?

Yes, they are installed in a certified server. The temperature in the room is 19º Celsius.

The temperature in the cards reaches up to 60ºC peak.

I test what you suggest now (download the cifar10_multi_gpu_train code from tensorflow), and fails too:

python3 cifar10_multi_gpu_train.py --num_gpus=2

2017-07-31 18:35:54.284307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1
2017-07-31 18:35:54.284316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y Y
2017-07-31 18:35:54.284320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1:   Y Y
2017-07-31 18:35:54.284334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:83:00.0)
2017-07-31 18:35:54.284340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:84:00.0)
2017-07-31 18:35:54.858678: I tensorflow/core/common_runtime/simple_placer.cc:675] Ignoring device specification /device:GPU:1 for node 'tower_1/fifo_queue_Dequeue' because the input edge from 'prefetch_queue/fifo_queue' is a reference connection and already has a device field set to /device:CPU:0
2017-07-31 18:35:54.858723: I tensorflow/core/common_runtime/simple_placer.cc:675] Ignoring device specification /device:GPU:0 for node 'tower_0/fifo_queue_Dequeue' because the input edge from 'prefetch_queue/fifo_queue' is a reference connection and already has a device field set to /device:CPU:0
2017-07-31 18:35:58.427516: step 0, loss = 4.68 (137.1 examples/sec; 0.933 sec/batch)
2017-07-31 18:35:59.883505: step 10, loss = 4.63 (7392.7 examples/sec; 0.017 sec/batch)
2017-07-31 18:36:00.205297: step 20, loss = 4.64 (8538.2 examples/sec; 0.015 sec/batch)
2017-07-31 18:36:01.240517: E tensorflow/stream_executor/cuda/cuda_driver.cc:1098] could not synchronize on CUDA context: CUDA_ERROR_UNKNOWN :: No stack trace available
2017-07-31 18:36:01.240518: E tensorflow/stream_executor/cuda/cuda_driver.cc:1098] could not synchronize on CUDA context: CUDA_ERROR_UNKNOWN :: No stack trace available
2017-07-31 18:36:01.240546: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_UNKNOWN
2017-07-31 18:36:01.240592: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_UNKNOWN
2017-07-31 18:36:01.240625: F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed
2017-07-31 18:36:01.240626: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1
2017-07-31 18:36:01.240632: F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed
2017-07-31 18:36:01.240632: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1

The console messages are:

[ 7063.325682] NVRM: GPU at PCI:0000:83:00: GPU-0694dbcb-6cb3-c7bb-05f8-4185fe20d67c
[ 7063.325697] NVRM: GPU Board Serial Number: 0323615069723
[ 7063.325699] NVRM: Xid (PCI:0000:83:00): 79, GPU has fallen off the bus.
[ 7063.325699] NVRM: GPU at 0000:83:00.0 has fallen off the bus.
[ 7063.325712] NVRM: GPU is on Board 0323615069723.
[ 7063.325718] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
[ 7063.325726] NVRM: GPU at PCI:0000:84:00: GPU-e405f17d-0c15-c2b1-eaf6-2f50a0c524ec
[ 7063.325743] NVRM: GPU Board Serial Number: 0323615069723
[ 7063.325744] NVRM: Xid (PCI:0000:84:00): 79, GPU has fallen off the bus.
[ 7063.325744] NVRM: GPU at 0000:84:00.0 has fallen off the bus.
[ 7063.325760] NVRM: GPU is on Board 0323615069723.

The same problem…

Checking the power consumption graphs indicates a peaks of 308 Watts (the power source is of 800Watts)

A PSU rated for 800W should be sufficient for a system with a single K80 unless you have a very power-hungry host system.

The total sum of specified power for all components should not exceed 60% of rated PSU output for stable operation (this accounts for power spikes, component aging, etc). That would give you a power budget of 480W in this case. The K80 itself is rated at 300W, leaving 180W for other system components. If the system has dual CPUs, those remaining 180W are likely not sufficient. If it has a single CPU, it is likely sufficient.

Checking the system, it has a 1280W power source, not 800W as i though.
It has 2 Intel Xeon E5-2640v4 at 2.4Ghz and 128Gbytes RAM.
In theory it should support the system, but not.

I set disable the auto-boost and now it works well. It looks a power consumption problem :-|

Thanks in any case for your help. I will contact my provider in order to check with SuperMicro what can happens!