Ubuntu 17.10, Nvidia 390.48, CUDA 9.1, GPU has fallen off the bus

Hello,

I was already looking for a solution for this problem in this forum and on other sites, but found none. There are already a lot of posts about this problem, but without an helpful answer. I know, that there could be various causes why I get this error, but I hope that there is a expert who can give a hint. I’m out of any ideas now.

Problem: After some time ( approx. 30 minutes - 2 hours) of mining a GPU got lost. Here a part of the bug-report attached below:

Apr 21 15:59:48 user kernel: NVRM: GPU at PCI:0000:01:00: GPU-1d2e0e8c-69d8-2596-20cd-454daa1bf595
Apr 21 15:59:48 user kernel: NVRM: GPU Board Serial Number: 
Apr 21 15:59:48 user kernel: NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
Apr 21 15:59:48 user kernel: NVRM: GPU at 0000:01:00.0 has fallen off the bus.
Apr 21 15:59:48 user kernel: NVRM: GPU is on Board .
Apr 21 15:59:48 user kernel: NVRM: A GPU crash dump has been created. If possible, please run
                                 NVRM: nvidia-bug-report.sh as root to collect this data before
                                 NVRM: the NVIDIA kernel module is unloaded.

Bug report: https://pastebin.com/4TL7BkBL

System Info:

  • 6x Palit GeForce GTX 1070 Ti Dual 8GB (currently 5 of 6 cards connected), connected via risers.
  • PSU: 1x Be quiet! Dark power Pro, 1000W, 1x Seasonic Prime Platinum 1200W
  • Mainboard: MSI Z270-A PRO
  • CPU: Intel Core i5-7500, 3,4GHz
  • RAM: 16GB
  • SSD: 64GB
  • OS: Ubuntu 17.10 (GNU/Linux 4.13.0-38-generic x86_64)
  • Driver: 390.48
  • CUDA: 9.1.85

Application info:

  • ethminer 0.15.0dev4 with CUDA options via SSH, headless server
  • Overclocking: persistence mode: On, powerlimit at 100W, Fanspeed at 70%, nothing else

I already tried:

  • Reseated GPU
  • Setting Fan Speed in order to get max 56°C
  • Remove and fresh install of the driver
  • Disabling audio and bus speed to 96 in BIOS options

Can anyone help, please? I tried to find a solution for weeks but without success.

Best Regards,

Julian

That’s your thread:
https://devtalk.nvidia.com/default/topic/1016888/?comment=5235192
In short, check power supply, check the risers, check the board and its slots.
Some people have used aluminium foil to enhance the risers, or used better ones.