kernel: [7766925.279896] NVRM: GPU at 0000:89:00.0 has fallen off the bus

DjamalAbide · November 18, 2016, 4:18pm

Hi Support Team,

This is short description of my platform:

Ubuntu 14.04.5 LTS
4 NVIDIA Tesla M40 cards
CUDA Driver Version / Runtime Version: 7.5 / 7.5
CUDA Capability Major/Minor version number: 5.2

Description of my issue:

I was running a job using only one of the cards for a couple of hours. My job stopped running and at the same time the syslog started repeatedly showing the following error:

Nov 16 20:22:00 NVC-BARMLS01 nvidia-persistenced: Failed to unlink socket: No such file or directory
Nov 16 20:22:00 NVC-BARMLS01 nvidia-persistenced: Failed to unlink PID file: No such file or directory
Nov 16 20:22:00 NVC-BARMLS01 nvidia-persistenced: Shutdown (18757)
Nov 16 20:22:00 NVC-BARMLS01 nvidia-persistenced: Started (17210)
Nov 16 20:22:18 NVC-BARMLS01 nvidia-persistenced: Failed to unlink socket: No such file or directory
Nov 16 20:22:18 NVC-BARMLS01 nvidia-persistenced: Failed to unlink PID file: No such file or directory
Nov 16 20:22:18 NVC-BARMLS01 nvidia-persistenced: Shutdown (17210)
Nov 16 20:22:21 NVC-BARMLS01 nvidia-persistenced: Started (17239)
...

After a long period repeating the same error, the syslog has shown:

Nov 17 09:10:52 NVC-BARMLS01 nvidia-persistenced: Failed to unlink socket: No such file or directory
Nov 17 09:10:52 NVC-BARMLS01 nvidia-persistenced: Failed to unlink PID file: No such file or directory
Nov 17 09:10:52 NVC-BARMLS01 nvidia-persistenced: Shutdown (44319)
Nov 17 09:10:55 NVC-BARMLS01 nvidia-persistenced: Started (44333)
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.279896] NVRM: GPU at 0000:89:00.0 has fallen off the bus.
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.297456] pcieport 0000:80:03.0: AER: Uncorrected (Non-Fatal) error received: id=8018
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.297462] pcieport 0000:80:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=8018(Requester ID)
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.297463] pcieport 0000:80:03.0:   device [8086:6f08] error status/mask=00004000/00000000
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.297465] pcieport 0000:80:03.0:    [14] Completion Timeout     (First)
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.297468] pcieport 0000:80:03.0: broadcast error_detected message
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.297476] i40e 0000:87:00.0: i40e_pci_error_detected: error 1
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.298334] NVRM: A GPU crash dump has been created. If possible, please run
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.298334] NVRM: nvidia-bug-report.sh as root to collect this data before
Nov 17 09:11:09 NVC-BARMLS01 kernel: [7766925.298334] NVRM: the NVIDIA kernel module is unloaded.

I’ve followed the instruction for collecting the bug report. However, I do not know how to give it to you.

Please notice that this issue has affected the PCIe bus and took at many other PCI devices and the server down with it. We lost network and the server was basically unresponsive. We’d to reboot it.

Thanks,
Djamal

nvidia-bug-report.log.gz (65.7 KB)

DjamalAbide · November 18, 2016, 7:19pm

Hi all,

I’ve attached my ‘nvidia-bug-report.log.gz’

Thanks,
Djamal.

Topic		Replies	Views
nVidia card has fallen off the bus CUDA Setup and Installation	1	1597	April 16, 2013
GPU at 0000:02:00.0 has fallen off the bus. CUDA Programming and Performance	6	9000	November 28, 2011
GPU has fallen of the bus, nvidia-361.28, kernel 4.2.0 Linux	1	1635	February 28, 2016
Ubuntu 16.04 GTX 750 Ti GPU has fallen off the bus Linux	0	1617	December 26, 2016
NVRM: GPU at 0000:01:00.0 has fallen off the bus CUDA Programming and Performance	2	6937	September 1, 2011
Ubuntu 17.10, Nvidia 390.48, CUDA 9.1, GPU has fallen off the bus Linux	1	1951	April 24, 2018
GPU has fallen of the bus Linux	15	7741	July 19, 2019
Tesla K10 "has fallen off the bus" Linux	5	3268	May 13, 2013
GPU has fallen off the bus GPU - Hardware	0	995	October 25, 2019
GPU has fallen off the bus on linux-4.5.0-rc5, nvidia-361 on a Optimus notebook. Linux	1	1974	March 23, 2016

kernel: [7766925.279896] NVRM: GPU at 0000:89:00.0 has fallen off the bus

Related topics