GPU is lost during execution of either Tensorflow or Theano code

When training either one of two different neural networks, one with Tensorflow and the other with Theano, sometimes after a random amount of time (could be a few hours or minutes, mostly a few hours), the execution freezes and I get this message by running “nvidia-smi”:
“Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU”

I tried monitoring the GPU performance for 13-hours execution, and everything seems stable - see graph attached (or here https://unsee.cc/pusogiba/)
Also, this behavior repeats on another GPU on the same machine.

I’m working with:

  • Ubuntu 14.04.5 LTS
  • GPUs are Titan Xp
  • CUDA 8.0
  • CuDNN 5.1
  • Tensorflow 1.3
  • Theano 0.8.2

I’m not sure how to approach this problem, can anyone please suggest ideas of what can cause this and how to diagnose/fix this?
Many thanks!

gpu02.png

What’s the system platform? Your graph tells us cooling is adequate, how about power supply? What’s the rated output of the PSU, and how many GPUs are in this system? The description suggests there could be brown outs under load. The GPU draw power from the PCVIe slot, as well as from a 6-pin and an 8-pin PCIe power connector. Are these plugged in properly?

Thanks for the quick reply.

The system is Ubuntu 14.04, there are 4 GPU devices, but I’m limiting every execution to a single device (with CUDA_VISIBLE_DEVICES).

I’m not sure how to monitor the power supply, what could be a simple way of doing this?
(for generating this graph I followed this guide - http://nvidia.custhelp.com/app/answers/detail/a_id/3751/~/useful-nvidia-smi-queries)

When I check with “nvidia-smi” during execution I see about 50W-70W usage out of 250W.

As this is a remote server, I’ll need to check the connectors and get back to you.

What is of interest is the maximum power consumption when GPU usage of the app is most intense. Loop the nvidia-smi output and record the highest value reported.

If you are always just using a single GPU out of four, it seems very unlikely that you are overloading the PSU. But it is still possible that the GPUs are not properly hooked up to the PSU. Make sure 6-pin and 8-pin connector are pushed all the way in (there should be a tab on the connector that engages when that is the case). There should be no 6-pin to 8-pin converters or Y-splitters used in the power cables.

Your problem may also be mechanical, leading to unreliable signalling across the PCIe interconnect: GPUs may not properly seated in their slots, may not be properly secured inside the case (bracket screwed down etc), may be subject to vibration from either inside or outside the case. There may be cracked traces on the motherboard.

Any possibility of electromagnetic interference? Unlikely, but make sure that the machine isn’t placed in close proximity to large electric motors, medical imaging equipment, or radiation sources.

The error message suggests that the GPU stopped responding to PCIe commands sent from the host system controller, so the flakiness you see is likely something hardware related.

I found these attached log records from the exact time the GPU was lost.
Does this add more information?

what kind of a computer are the 4 titan Xp plugged into? If it is a server, what server, what is the manufacturer and model number?
what gpu driver are you using?
what is the size of your power supply (i.e. what is the wattage rating).

For heavy usage of 4 Titan Xp, I would recommend a large system power supply on the order of 1600W or more.

To rule out power as an issue, and especially since you are already restricting job usage to a single GPU, you could see if power may be an issue by removing 3 of the TitanXp and then running your test with just a single GPU. If no issues, add another GPU and repeat, then add another GPU and repeat, etc.

As noted in #2 and #6, it would be good to know the host system (vendor, SKU) and the power rating of the PSU. The log in #5 just confirms that the GPU stopped responding to PCIe commands sent by the host controller.

Other than an under-dimensioned PSU, the problem might also be that the electric input to the PSU is extremely noisy (e.g. spikes, brownouts) and that the PSU is not of sufficient quality to filter out all that noise, negatively impacting system reliability.

I would recommend an 80 PLUS Platinum rated PSU. Not only are they very efficient, they tend to use much more robust designs than cheap garden-variety PSUs and their individual components are usually of higher quality.

It’s of course also possible that components on the motherboard or on one of the GPUs are damaged (e.g. through handling of hardware without proper ESD precautions). The test suggested by txbob, systematically cycling GPUs through the PCIe slots, should indicate whether the issue is correlated with a particular GPU or a particular slot.

I’ll do the power checks once I have access to the server (I’m working remotely).

txbob, about the details you asked for:

Machine: System: ASUS product: All Series
Mobo: ASUSTeK model: RAMPAGE V EXTREME version: Rev 1.xx Bios: American Megatrends version: 3301 date: 06/28/2016:

Driver Version: 375.66

Examining Syslog further shows that the following messages appear about 10 times a second:

[53810.323904] pcieport 0000:00:03.0: AER: Corrected error received: id=0018
[53810.323915] pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0018(Receiver ID)
[53810.323919] pcieport 0000:00:03.0: device [8086:6f08] error status/mask=00000040/00002000
[53810.323922] pcieport 0000:00:03.0: [ 6] Bad TLP

According to the link below, this is related to the PCIe Active State Power Management, which keeps the link on a lower power state. It is suggested there to set the boot parameter “pcie_aspm=off” to solve this. Do you think that could help?
https://askubuntu.com/questions/863150/pcie-bus-error-severity-corrected-type-physical-layer-id-00e5receiver-id

Many thanks for the help!

Disabling PCIe power management on the host seems worth a try. GPUs have their own power management for the PCIe link, so I doubt there is any significant difference in overall power consumption from turning off the host’s management.

No idea whether this is an option for you, but you may also want to consider updating the system BIOS. The most recent version I could find for RAMPAGE V EXTREME is BIOS 3701 from June 2017.

I don’t have recent experience with Asus, but for the HP and Dell systems I have used over the past ten years I have always kept the SBIOS up to date, with no ill effect.

I am facing a similar problem, and I want to double check it is really a power issue before investing in a new PSU. My complete build can be found here: https://pcpartpicker.com/user/batchnormalized/saved/wkjW3C. I am using:

  • Linux Mint 19.2
  • Tensorflow 1.14
  • CUDA version 10.2
  • Driver version: 440.59
  • CuDNN version: 7.6.0
  • Keras version: 2.3.0
  • A single GEForce RTX 2080 Ti

My exact issue is that while training Alexnet in Keras with Tensorflow the screen will suddenly go black, the fans will turn up high, and then I am unable to interact with the computer. The only thing I can do is a hardware reset. I wrote some scripts using nvidia-smi, powerstat, and sensors to capture temperature and power consumption readings every second from the CPU and GPU. I also captured kernel and syslogs at the time of the failure. Those can be found all here, along with an HTML file visualizing the temperature and power data: https://drive.google.com/open?id=1ms4FgJZUrn2TCg7vXKeJIYxlC2ZJO3Df.

Sometimes the issue above will happen, and other times the computer will crash and reboot on it’s own altogether either midtraining or right after being rebooted manually due to having the aforementioned failure.

It does not appear from the GPU logs that either the temperature nor the power is growing out of control. This makes the problem all the more confusing, as if the power had clearer spikes it would be obvious the problem was PSU related. I am currently using a EVGA SuperNOVA G3 1000 W 80+ Gold Certified Fully Modular ATX Power Supply in my build, which based on the build creation website should be more than enough to handle the power from all of the components. If anyone has any suggestions about how to further narrow down this to a power issue that would be great. Thank you!

I’ll try checking the power pins and PCI-e connection in the meantime since I have the exact same error logged as that posted by OP. I already tried updating the BIOS and the Nvidia drivers and that did not solve the issue.

.

It is difficult to diagnose potential hardware issues over the internet. The success rate in these forums is only about 20-25%. I looked at a few logs and did not spot any red flags. The GPU does not seem to have been particularly heavily loaded at the point where the system went belly up.

The wattage of your PSU is not a problem. Your system specs suggest total power of around 400W, and your 1000W PSU should be good for a system load of 600W even when using a very conservative approach. This approach already incorporates head room for power spikes in the millisecond range. EVGA SuperNOVA is known to me as a quality product line, so as long as the part is genuine and not counterfeit I have no reason to suspect that there is something inherently wrong with the PSU.

If power supply and cooling for both CPU and GPU are in good working order (and it seems you already double checked on those), there is the possibility that system components were incorrectly assembled, damaged during assembly, or were defective to begin with.

If this system is self-assembled, it might be a good idea to check on all connectors (power cables, PCIe slots, DRAM slots, CPU socket) to make sure they are clean and have proper contact. Make sure the GPU is mechanically supported. i.e. secured at the bracket.

Double check system BIOS settings including DRAM timing. Unless you did something special, DRAM timings should default to the configuration spec’ed by the DRAM sticks themselves.

Thanks for getting back to me so quickly! The system is in fact self-assembled and I had some issue fitting the PSU into the case so I suspect it may be a connection issue. I’m about to open up the computer right now and try unplugging and replugging all of the power cables and resetting the GPU into its PCI-e lane. I’ll let you all know if that worked.

It looks like it worked! All I did was disconnect and reconnect the GPU entirely to make sure it was properly set in the PCI-e lane and that the power cables were properly connected to it. I also checked the power connections to the motherboard and other components just in case. I’ve run probably over 60 epochs of training without any crashes or black screens. Funny something so seemingly simple could cause such a big problem. Thanks for all the help! I had been stuck on this issue for a couple of weeks.