This is my first post in Devtalk. I am using 1080GTX Ti founders edition for Deep Learning activities. I was training something for the past week and couple of days ago the gpu got shutdown with this message.
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU
After that message, whenever I am trying to train using the GPU, I am seeing that power drawn is throttled to about 70 W even with 8Gig/11Gig memory usage and 100% utilization, (It used to be about 200 before that error).
Also, the GPU clock is throttled to 139MHz. I ran the phoronix-test-suite benchmark test and I got the score of 24 in comparison to what it should be which is about 120 for a typical 1080 GPU.
The first things you want to check is power supply (PSU, cabling) and cooling (does the GPU’s spin up, is the airflow to the GPU fan unobstructed)? Also, check whether the GPU is still firmly seated in the PCIe slot (it should be mechanically secured at the bracket).
What’s the power rating of the PSU? Is there more than one GPU in this system?
The most likely cause of what you are seeing is insufficient power supply. Maybe a power cable for the GPU got disconnected. This shouldn’t happen if the connector on the cable is properly engaged with the tab of the connector on the GPU. Less likely is that there is permanent damage to the PSU, possibly caused by continuous very high load or quality issues. Ideally, the total nominal power consumption of all system components would not exceed 60% of the rated wattage of the PSU.
In my experience, in general, PSUs are the system components most likely to fail, followed by DRAM. I recommend the use of 80 PLUS Platinum rated PSUs for workstations, and 80 PLUS Titanium rated PSUs for servers, as these are usually built from higher-quality components, in addition to being very energy efficient.
800W PSU for a system with a single GPU of this type should be fine. Examine the power cabling carefully. There should be no Y-splitter or 6-pin to 8-pin converter. Unplug and re-connect the power cables. You should hear a click as the connectors engage fully. Is there any visible damage to the metal parts of the connectors? Is there worn insulation? I am not sure how an end-user would check a PSU for proper operation, other than trying a different PSU of the same wattage to see whether this helps.
Is this machine located in an office or computer room, or a more adverse environment, e.g. high humidity, extreme altitude, vibrations (e.g. ship, factory floor), or near large electric machinery?
check/monitor GPU temperature with nvidia-smi when you have the load on it. The current bug log you attached is useless for this inquiry as it is capturing the state where the GPU has already fallen off the bus, so the nvidia-smi query in it is just reporting that.
Please remember to also examine the GPU temperature as suggested by txbob. I don’t know what exactly to expect for a GTX 1080Ti, but up to 80 deg should still be normal. In general, you would want “GPU Current Temp” sufficiently below “GPU Slowdown Temp”.
When the GPU is running flat out it is in performance state P0. P2 is the highest (?) power saving state. There are also even lower power-saving states, such as P8 and P12. I think P2 is used when neither compute nor 3D-graphics tasks are running on the GPU, and it drives only the operating system’s GUI.
The GPU gets power through the PCIe slot (up to 75W are allowed by the spec, although with most NVIDIA GPUs it is just 40W to 50W) and the rest is supplied via the PCIe power cables (6-pin: up to 75W; 8-pin: up to 150W).
The 23% of fan use seems consistent with the relatively small amount of power dissipated in P2 state, and it would seem to indicate that the fan is working and regulated properly according to power consumption. The fact that power consumption is limited to about 70W suggests to me that the power supply via the PCIe power cable might be missing, for whatever reason, causing the GPU to be limited to the power supplied via the PCIe socket (which OP confirmed the card is firmly seated in).
It is very difficult to diagnose such issues remotely without access to the machine. Our success rate resolving such issues in these forums is only about 20%. It’s possible that some hardware defect has developed in the PSU or the GPU itself, but I wouldn’t know how to drill down on that remotely. You may want to engage the help of a knowledgeable local person who has physical access to the machine.