Ubuntu 16.04 GTX 1080TI can not run correctly

We have 4 1080TI installed on our Ubuntu 16.04 physical server.
GPUs down one by one since Oct 1, now we have only GPU show in nvidia-smi.

latest dmesg info:

[151117.917922] NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0
[151118.058834] NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:1426)
[151118.058860] NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 1
[151118.083664] NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:1426)
[151118.083689] NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 1
[151118.106863] NVRM: GPU 0000:82:00.0: RmInitAdapter failed! (0x31:0xffff:2502)
[151118.106896] NVRM: GPU 0000:82:00.0: rm_init_adapter failed, device minor number 2
[151118.245241] NVRM: GPU 0000:82:00.0: RmInitAdapter failed! (0x31:0xffff:2502)
[151118.245276] NVRM: GPU 0000:82:00.0: rm_init_adapter failed, device minor number 2
[151118.965878] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x31:0xffff:2502)
[151118.965910] NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0
[151119.103959] NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x31:0xffff:2502)
[151119.103988] NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0
[151119.244525] NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:1426)
[151119.244550] NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 1
[151119.269441] NVRM: GPU 0000:03:00.0: RmInitAdapter failed! (0x23:0xffff:1426)
[151119.269466] NVRM: GPU 0000:03:00.0: rm_init_adapter failed, device minor number 1
[151119.292448] NVRM: GPU 0000:82:00.0: RmInitAdapter failed! (0x31:0xffff:2502)
[151119.292482] NVRM: GPU 0000:82:00.0: rm_init_adapter failed, device minor number 2
[151119.430791] NVRM: GPU 0000:82:00.0: RmInitAdapter failed! (0x31:0xffff:2502)
[151119.430817] NVRM: GPU 0000:82:00.0: rm_init_adapter failed, device minor number 2
➜  ls /proc/driver/nvidia/gpus
0000:02:00.0  0000:03:00.0  0000:82:00.0  0000:83:00.0

only 83 can be seen in output of nvidia-smi.

How can I solved this? Our laboratory use these GPUs to accelerate our AI performance. Please and THANKS.

I tried upgrade driver 440 to 535, tried block nouveau module.

nvidia-bug-report.log.gz (451.2 KB)

First of all, please have nvidia-persistenced starting on boot since you’re running headless. Then there seems to be a cooling problem, the remaining gpu is at 51°C while being idle. Checked whether the cards are blocking each other’s airflow? Please check if the failed gpus are coming back to life after a reboot, otherwise you might have to check for faulty hardware.