Hi Support Team,
This is short description of my platform:
- Ubuntu 14.04.5 LTS
- 4 NVIDIA Tesla M40 cards
- CUDA Driver Version / Runtime Version: 7.5 / 7.5
- CUDA Capability Major/Minor version number: 5.2
Description of my issue:
I was running a job using only one of the cards for a couple of hours. My job stopped running and at the same time the syslog started repeatedly showing the following error:
Nov 16 20:22:00 NVC-BARMLS01 nvidia-persistenced: Failed to unlink socket: No such file or directory
Nov 16 20:22:00 NVC-BARMLS01 nvidia-persistenced: Failed to unlink PID file: No such file or directory
Nov 16 20:22:00 NVC-BARMLS01 nvidia-persistenced: Shutdown (18757)
Nov 16 20:22:00 NVC-BARMLS01 nvidia-persistenced: Started (17210)
Nov 16 20:22:18 NVC-BARMLS01 nvidia-persistenced: Failed to unlink socket: No such file or directory
Nov 16 20:22:18 NVC-BARMLS01 nvidia-persistenced: Failed to unlink PID file: No such file or directory
Nov 16 20:22:18 NVC-BARMLS01 nvidia-persistenced: Shutdown (17210)
Nov 16 20:22:21 NVC-BARMLS01 nvidia-persistenced: Started (17239)
...
After a long period repeating the same error, the syslog has shown:
Nov 17 09:10:52 NVC-BARMLS01 nvidia-persistenced: Failed to unlink socket: No such file or directory
Nov 17 09:10:52 NVC-BARMLS01 nvidia-persistenced: Failed to unlink PID file: No such file or directory
Nov 17 09:10:52 NVC-BARMLS01 nvidia-persistenced: Shutdown (44319)
Nov 17 09:10:55 NVC-BARMLS01 nvidia-persistenced: Started (44333)
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.279896] NVRM: GPU at 0000:89:00.0 has fallen off the bus.
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.297456] pcieport 0000:80:03.0: AER: Uncorrected (Non-Fatal) error received: id=8018
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.297462] pcieport 0000:80:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=8018(Requester ID)
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.297463] pcieport 0000:80:03.0: device [8086:6f08] error status/mask=00004000/00000000
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.297465] pcieport 0000:80:03.0: [14] Completion Timeout (First)
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.297468] pcieport 0000:80:03.0: broadcast error_detected message
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.297476] i40e 0000:87:00.0: i40e_pci_error_detected: error 1
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.298334] NVRM: A GPU crash dump has been created. If possible, please run
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.298334] NVRM: nvidia-bug-report.sh as root to collect this data before
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.298334] NVRM: the NVIDIA kernel module is unloaded.
I’ve followed the instruction for collecting the bug report. However, I do not know how to give it to you.
Please notice that this issue has affected the PCIe bus and took at many other PCI devices and the server down with it. We lost network and the server was basically unresponsive. We’d to reboot it.
Thanks,
Djamal
nvidia-bug-report.log.gz (65.7 KB)