GPUs temporary disappear during runtime (driver 384.59)

We have a 3-Pascal-GPU-setup (GTX 1080, P40, Titan X).
One, two or all three GPUs disappear from nvidia-smi during runtime and appear again after some time (seconds/minutes).
The GPUs are also not available for programms(tensorflow etc) during that time, so the problem is not only nvidia-smi related.

I guess the problem is driver-related, since it did appear with the new version (384.59).

We can not roll back to old version shipped with the cuda8-package(verion375.66) due to another seriuos bug.

our setup details:

== nvidia-smi ===================================================
±----------------------------------------------------------------------------+
| NVIDIA-SMI 384.59 Driver Version: 384.59 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A |
| 29% 44C P0 41W / 180W | 0MiB / 8114MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 TITAN X (Pascal) Off | 0000:04:00.0 Off | N/A |
| 0% 52C P0 56W / 250W | 0MiB / 12189MiB | 2% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla P40 Off | 0000:83:00.0 Off | 0 |
| N/A 69C P0 128W / 250W | 21897MiB / 22912MiB | 93% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 89766 C /home/a/dl/bin/python 21895MiB |
±----------------------------------------------------------------------------+

== cat /etc/issue ===============================================
Linux b 4.4.0-89-generic #112-Ubuntu SMP Mon Jul 31 19:38:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION=“16.04.3 LTS (Xenial Xerus)”
VERSION_ID=“16.04”
VERSION_CODENAME=xenial

== are we in docker =============================================
No

== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Copyright © 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== uname -a =====================================================
Linux b 4.4.0-89-generic #112-Ubuntu SMP Mon Jul 31 19:38:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

== env ==========================================================
LD_LIBRARY_PATH /home/a/torch/install/lib:/home/a/torch/install/lib::/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/:/usr/local/lib/:/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/:/usr/local/lib/
DYLD_LIBRARY_PATH /home/a/torch/install/lib:/home/a/torch/install/lib:

== cuda libs ===================================================
/usr/local/cuda-8.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-8.0/doc/man/man7/libcudart.7
/usr/local/cuda-8.0/lib64/libcudart_static.a
/usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
/usr/local/cuda-7.5/doc/man/man7/libcudart.so.7
/usr/local/cuda-7.5/doc/man/man7/libcudart.7
/usr/local/cuda-7.5/lib/libcudart_static.a
/usr/local/cuda-7.5/lib/libcudart.so.7.5.18
/usr/local/cuda-7.5/lib64/libcudart_static.a
/usr/local/cuda-7.5/lib64/libcudart.so.7.5.18

== OS ==================

Linux version 4.4.0-89-generic (buildd@lgw01-18) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #112-Ubuntu

== Devs ==================

$ lspci -nnk | grep -i nvidia
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1b80] (rev a1)
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
02:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10f0] (rev a1)
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1b00] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:119a]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
04:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10ef] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:119a]
83:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1b38] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11d9]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384

== Hardware ==================

512 GB Ram

SMBIOS 2.8 present.
Handle 0x0200, DMI type 2, 8 bytes
Base Board Information
Manufacturer: Dell Inc.
Product Name: 0W9WXC
Version: A05
Serial Number: .3TZRV42.CN779214BO015N.

Machine: System: Dell product: PowerEdge T630
Mobo: Dell model: 0W9WXC v: A05 Bios: Dell v: 2.4.2 date: 01/09/2017

It doesn’t look to me like the P40 GPU is a supported GPU in the Dell T630 server. I’m not sure where you got that P40 or how or why you put it in a Dell T630.

Tesla GPUs should only be put into servers that are certified by the manufacturer for that GPU.

I’m pretty sure you are running the P40 in an unsupported configuration, so its not surprising to me that its not working well for you.