NVRM: GPU at 0000:01:00.0 has fallen off the bus

We are having this problem on both of our cuda servers(runlevel 3, no X). After boot up, this happens randomly after some time:

[32575.267062] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[32576.291482] irq 16: nobody cared (try booting with the “irqpoll” option)
[32576.291485] Pid: 0, comm: kworker/0:1 Tainted: P 2.6.39-gentoo-r3 #9
[32576.291487] Call Trace:
[32576.291488] [] __report_bad_irq+0x40/0xa9
[32576.291496] [] note_interrupt+0x14b/0x1b4
[32576.291499] [] handle_irq_event_percpu+0x178/0x190
[32576.291501] [] handle_irq_event+0x2c/0x48
[32576.291504] [] handle_fasteoi_irq+0x78/0x98
[32576.291507] [] handle_irq+0x83/0x8c
[32576.291508] [] do_IRQ+0x48/0xaf
[32576.291512] [] common_interrupt+0x13/0x13
[32576.291513] [] ? sched_clock_cpu+0x46/0xd1
[32576.291519] [] ? mwait_idle+0x9f/0xc6
[32576.291521] [] ? mwait_idle+0x4c/0xc6
[32576.291523] [] cpu_idle+0x5a/0x91
[32576.291525] [] start_secondary+0x180/0x184
[32576.291527] handlers:
[32576.291527] [] (usb_hcd_irq+0x0/0x5b)
[32576.291531] [] (nv_kern_isr+0x0/0x58 [nvidia])
[32576.291668] Disabling IRQ #16

After some googling around, we’ve noticed one post saying that putting nvidia driver in persistence mode would help in
solving this problem so we did nvidia-smi -pm 1 but nothing changed. We are still seeing this problem
happening at random times after boot up, making all cuda applications hang, including nvidia-smi.

Please help ASAP,
thank you.

We are using nvidia driver 275.21
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 275.2

The same thing is happening with the latest 280.13 drivers…Anyone? We are using a flexible PCI-e riser cable in order to fit the card in 1U casing. Could this be the source of our problems?

It seems the cables were the root of our problem. We have switched from 1U to 4U case so we can avoid any kind of risers and we haven’t had any issues since. Don’t use any kind of risers, they will only give you grief.