GPU (K40c) breaks down accidentally

I have been running some computation jobs on our Fedora server installed with 4 K40c gpus, the job is doing iterations. It runs no problem before, but today it causes the 4 GPUs to completely freeze. nvidia-smi won’t work (freezes) either, and top does not show any running GPU jobs. I encountered this problem just once before and now exactly the same thing is happening again. The job I run last time is similar but not exactly the same as this one. By rebooting the server I was able to get everything to work normally, but this problem is like a hidden detonator. Can anyone give some helpful comments here? I tried to attach the nvidia-bug-report-log file but it’s too large…

nvidia-bug-report.log (1.25 MB)

I am not going to dig through the log, but here are some generic steps I would take if this were my system.

(1) Is the power supply in the system capable of driving the main board and four K40c with about 25% headroom? As a guess, your system would probably want to use a 1500W power supply. My recommendation is always to use a high-quality PSU, e.g. 80plus gold/platinum rated.

(2) Are all the GPUs properly seated in their slots? Due to insertion pressure, boards can sometimes flex a bit during insertion, preventing the PCIe connector from making full contact. Are all GPU power supply cables correctly hooked up, and selected such that the load from the GPU is distributed evenly across the available cable groups of the PSU?

(3) Use nvidia-smi to continuously monitor the system, under full load. Are GPU power consumption and temperature in the expected range? The K40c is an actively cooled part that requires proper airflow in the case, otherwise some GPUs will draw in air pre-heated by other components, including neighboring GPUs.

(4) Make sure you have the latest system BIOS for your system installed. High-quality products usually offer free system BIOS updates for at least the first three years after purchase.

(5) Make sure you use NVIDIA’s latest released (non-beta) driver package for your platform.

Thank you for your suggestions! I’m contacting microway to see if they think it’s a hardware issue.

So this seems to be a vendor-configured machine, rather than a home-built server, making problems with the PSU or power cabling, as well as PCIe connectivity issues, seem unlikely. I would check the software installation including the system BIOS before calling the hardware vendor, and also make sure the application itself is sound, by running it under control of cuda-memcheck if possible.

Other idea: Does your system work when running with just one GPU? If so, try each of the four GPUs in turn to see if it’s a specific GPU causing issues. If one GPU does seem to be the cause, SWAP it with one of the other GPUs. If the problem “follows” the bad GPU, it may be a bad GPU. If the problem instead now occurs to the new GPU in the “bad slot”, it might be a hardware problem or a BIOS problem. (Many motherboards have problems in their BIOS when dealing with PCIE address space allocation for 4 or more GPUs).

If a single GPU works for all four cards all the time, the cause could be anything, but you at least have a clue.

Thank you for your feedback! The problem I’m running into seems to be pretty random, the same program would run smoothly for a while until it breaks down unexpectedly… I don’t even know when I can reproduce the problem but I’ll try monitoring it w cuda-memcheck until the problem rises again.

The difficult part to test this is that the problem occurs randomly. I’m still not able to reproduce the problem. So I don’t know how long will it take for the problem to pop again…

One thing you might want to do is to log system parameters (power consumption, temperature, memory use, task running, etc, etc) continuously. When the next crash occurs, you can then inspect the recorded information to check whether anything unusual happened just before the crash.

Rarely occurring random crashes are every engineer’s nightmare. One typical approach is to attempt to increase the frequencies of the crashes to allow more experiments to be applied in a given time period. For example, it might turn out that particular parts of an app are much more likely to cause the crash, or that processing particular data sets trigger it more frequently. As you may imagine, such a debug process can go on for weeks, and may involve an entire team of engineers brainstorming and running experiments, worst case 24/7. Been there, done that, got the t-shirt :-)