CUDA devices randomly appear/disappear; NVRM: rm_init_adapter failed for device; 367 ok - 375+ not;

Problem:
With newer nvidia drivers, randomly devices are “disappearing”. The system is a GPU computing server with four GPUs (Tesla M40 24GB).
Any ideas highly appreciated,

Best,

Henrik

uname -a
Linux pk02 4.8.0-41-generic #44~16.04.1-Ubuntu SMP Fri Mar 3 17:11:16 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
(same problem with older kernels; 4.4 when I remember correctly)

No problems with older GPU driver:
nvidia-smi
Tue Mar 21 12:42:23 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 24GB Off | 0000:0D:00.0 Off | 0 |
| N/A 32C P0 56W / 250W | 0MiB / 22939MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla M40 24GB Off | 0000:13:00.0 Off | 0 |
| N/A 30C P0 57W / 250W | 0MiB / 22939MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla M40 24GB Off | 0000:8E:00.0 Off | 0 |
| N/A 27C P0 57W / 250W | 0MiB / 22939MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla M40 24GB Off | 0000:91:00.0 Off | 0 |
| N/A 29C P0 62W / 250W | 0MiB / 22939MiB | 88% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |

corresponding kernel messages:
Mar 17 16:08:49 pk01 kernel: [ 4.747264] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 367.57 Mon Oct 3 20:37:01 PDT 2016
Mar 17 16:08:49 pk01 kernel: [ 4.826195] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 367.57 Mon Oct 3 20:32:57 PDT

With 375:
corresponding kernel messages after calling nvidia-smi:
Mar 16 15:47:58 pk01 kernel: [ 4.433469] nvidia: module license ‘NVIDIA’ taints kernel.
Mar 16 15:47:58 pk01 kernel: [ 4.676885] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 375.39 Tue Jan 31 20:47:00 PST 2017 (using threaded interru$
Mar 16 15:47:58 pk01 kernel: [ 4.754903] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 375.39 Tue Jan 31 19:41:48 PS$
Mar 16 15:48:04 pk01 kernel: [ 21.253751] NVRM: RmInitAdapter failed! (0x53:0xffff:1857)
Mar 16 15:48:04 pk01 kernel: [ 21.253852] NVRM: rm_init_adapter failed for device bearing minor number 2
Mar 16 15:48:11 pk01 kernel: [ 27.453034] NVRM: RmInitAdapter failed! (0x53:0xffff:1857)
Mar 16 15:48:11 pk01 kernel: [ 27.453125] NVRM: rm_init_adapter failed for device bearing minor number 3

With 378:
corresponding kernel messages:
Mar 17 15:27:01 pk01 kernel: [ 4.613626] nvidia: module license ‘NVIDIA’ taints kernel.
Mar 17 15:27:01 pk01 kernel: [ 4.613627] nvidia: module license ‘NVIDIA’ taints kernel.
Mar 17 15:27:01 pk01 kernel: [ 4.951851] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 378.13 Tue Feb 7 20:10:06 PST 2017 (using threaded interru$
Mar 17 15:27:01 pk01 kernel: [ 4.968068] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 378.13 Tue Feb 7 18:30:08 PS$
Mar 17 15:27:01 pk01 kernel: [ 15.047878] NVRM: RmInitAdapter failed! (0x53:0xffff:1857)
Mar 17 15:27:01 pk01 kernel: [ 15.047938] NVRM: rm_init_adapter failed for device bearing minor number 0
Mar 17 15:28:56 pk01 kernel: [ 136.997420] NVRM: RmInitAdapter failed! (0x53:0xffff:1857)
Mar 17 15:28:56 pk01 kernel: [ 136.997493] NVRM: rm_init_adapter failed for device bearing minor number 0
Mar 17 15:29:03 pk01 kernel: [ 143.881740] NVRM: RmInitAdapter failed! (0x53:0xffff:1857)
Mar 17 15:29:03 pk01 kernel: [ 143.882296] NVRM: rm_init_adapter failed for device bearing minor number 1
Mar 17 15:29:13 pk01 kernel: [ 153.795640] NVRM: RmInitAdapter failed! (0x53:0xffff:1857)
Mar 17 15:29:13 pk01 kernel: [ 153.795763] NVRM: rm_init_adapter failed for device bearing minor number 3
Mar 17 15:29:27 pk01 kernel: [ 167.695291] NVRM: RmInitAdapter failed! (0x53:0xffff:1857)
Mar 17 15:29:27 pk01 kernel: [ 167.695361] NVRM: rm_init_adapter failed for device bearing minor number 0
Mar 17 15:29:34 pk01 kernel: [ 175.038085] NVRM: RmInitAdapter failed! (0x53:0xffff:1857)
Mar 17 15:29:34 pk01 kernel: [ 175.038360] NVRM: rm_init_adapter failed for device bearing minor number 1

nvidia-bug-report.log.gz (527 KB)