Tesla K10 "has fallen off the bus"

Hello,

I have the following setup:

Ubuntu 12.04 64-bit 3.2.0-32-generic
Supermicro GPU SuperServer 7047GR-TPRF with latest motherboard BIOS update
Tesla K10
2x Intel Xeon E5-2650
CUDA 5.0.35 with 310.19 driver installed after CUDA installation. (The driver packages with 5.0.35 only sees one of the K10 processors)

When I run a code that heavily uses the GPU, the code hangs after 30min - 1hr and syslog reports something like:

Dec 3 04:25:56 test1 kernel: [243058.035134] NVRM: GPU at 0000:85:00.0 has fallen off the bus.
Dec 3 04:25:56 test1 kernel: [243058.122474] NVRM: GPU at 0000:85:00.0 has fallen off the bus.
Dec 3 04:25:56 test1 kernel: [243058.209846] NVRM: GPU at 0000:86:00.0 has fallen off the bus.
Dec 3 04:25:56 test1 kernel: [243058.279744] NVRM: GPU at 0000:86:00.0 has fallen off the bus.

So I googled for this error and found this:

I followed those instructions and verified Persistence mode was on. That solution did not work after retrying several times.

The only way to restore access the GPU after this has occurred is to reboot the machine.

Unfortunately, I cannot attach the output of nvidia-bug-report.sh at this time. But I’ll try to answer questions as best I can.

Any ideas?

Have you had any success tracking this down? I’m having the exact same problem on the same hardware.

To both posters: if you have the same hardware (SuperMicro) please get your support from the vendor. Presumably they can help debug this issue considering you both bought expensive systems that are most likely supported by them. My only advice is the one I posted here:

https://devtalk.nvidia.com/default/topic/540185/nvidia-card-has-fallen-off-the-bus/

However for your particular case it might be hard or impossible to do those troubleshooting steps.

SuperMicro’s support is notoriously terrible. You, the customer, are almost completely on your own.

From what I’ve been able to track down, every time this problem has happened the GPU has been very warm, around 100 C. Normally we don’t see the GPUs getting that hot. Does anyone have any experience/numbers on how hot they expect the K10 to get?

@ekimd

Warm is an understatement, the 100 C temperature mark seems excessive, and most likely the case of the problem. I’ve been cautious to not let my GTX Titan go higher than 90 C when I’ve been testing different BIOS’ for overclocking purposes. You might want to look into a way of better cooling the system. Looking at the 7047GR-TPRF case, it looks fairly large with decent airflow.

Is this an active/passive cooled Tesla K10? If it’s passive, it shouldn’t be in a system like this without some way of generating enough airflow in the proper direction.

If it’s an actively cooled one, you might want to consider something like a PCI slot cooler if there is space for one on your setup and/or more powerful fans/better airflow for the case. Google has a few pictures of a few different PCI slot coolers, so you should be able to find one that works with your configuration if you want to go that route.

@ekimd

In my original post, the solution was to use the external PCI fans that come with that system. I didn’t have those properly installed at first. After doing so, all was well and that system has since been upgraded to 4x K10s without issues. I have recently, however, had two other systems with the same configuration start having this problem. Always seems thermal in nature. This time though, all the fans are installed properly AFAIK. :(