1080GTX Ti GPU clock and power drawn is throttled all of a sudden!

sachinpuranik2007 · June 28, 2018, 6:37pm

Hello All,

This is my first post in Devtalk. I am using 1080GTX Ti founders edition for Deep Learning activities. I was training something for the past week and couple of days ago the gpu got shutdown with this message.

Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU

After that message, whenever I am trying to train using the GPU, I am seeing that power drawn is throttled to about 70 W even with 8Gig/11Gig memory usage and 100% utilization, (It used to be about 200 before that error).

Also, the GPU clock is throttled to 139MHz. I ran the phoronix-test-suite benchmark test and I got the score of 24 in comparison to what it should be which is about 120 for a typical 1080 GPU.

I also ran the bug report script. Here is the output.
[url]https://drive.google.com/open?id=0B2v3VYhjV4-3R1pMWm1kcVFjSU9mVjhSR0NSVjJPSXpZUnFr[/url]

I am not sure whats going wrong or how to debug this issue. Any leads would be appreciated.

Thanks in advance.
nvidia-bug-report.log.gz (304 KB)

njuffa · June 28, 2018, 6:59pm

The first things you want to check is power supply (PSU, cabling) and cooling (does the GPU’s spin up, is the airflow to the GPU fan unobstructed)? Also, check whether the GPU is still firmly seated in the PCIe slot (it should be mechanically secured at the bracket).

What’s the power rating of the PSU? Is there more than one GPU in this system?

The most likely cause of what you are seeing is insufficient power supply. Maybe a power cable for the GPU got disconnected. This shouldn’t happen if the connector on the cable is properly engaged with the tab of the connector on the GPU. Less likely is that there is permanent damage to the PSU, possibly caused by continuous very high load or quality issues. Ideally, the total nominal power consumption of all system components would not exceed 60% of the rated wattage of the PSU.

In my experience, in general, PSUs are the system components most likely to fail, followed by DRAM. I recommend the use of 80 PLUS Platinum rated PSUs for workstations, and 80 PLUS Titanium rated PSUs for servers, as these are usually built from higher-quality components, in addition to being very energy efficient.

sachinpuranik2007 · June 28, 2018, 7:07pm

Yes, I can see that the GPU’s fan is spinning and GPU is firmly seated. The cabinet is in open space. I have just one GPU in the system and power rating is 800W.

njuffa · June 28, 2018, 7:13pm

800W PSU for a system with a single GPU of this type should be fine. Examine the power cabling carefully. There should be no Y-splitter or 6-pin to 8-pin converter. Unplug and re-connect the power cables. You should hear a click as the connectors engage fully. Is there any visible damage to the metal parts of the connectors? Is there worn insulation? I am not sure how an end-user would check a PSU for proper operation, other than trying a different PSU of the same wattage to see whether this helps.

Is this machine located in an office or computer room, or a more adverse environment, e.g. high humidity, extreme altitude, vibrations (e.g. ship, factory floor), or near large electric machinery?

Robert_Crovella · June 28, 2018, 7:27pm

check/monitor GPU temperature with nvidia-smi when you have the load on it. The current bug log you attached is useless for this inquiry as it is capturing the state where the GPU has already fallen off the bus, so the nvidia-smi query in it is just reporting that.

njuffa · June 28, 2018, 7:54pm

While you are at it (looking at the output of nvidia-smi -q, I mean) check “Fan Speed”, “Performance State”, and “Clocks Throttle Reasons”

sachinpuranik2007 · June 28, 2018, 9:03pm

Fan Speed : 23 %
Performance State : P2
Clocks Throttle Reasons
Applications Clocks Setting : Not Active
Clocks
Applications Clocks
Default Applications Clocks
Max Clocks
Max Customer Boost Clocks
Clock Policy

I dont know what that P2 suggests.

sachinpuranik2007 · June 28, 2018, 9:07pm

Please find the updated log attached.
nvidia-bug-report.log.gz (165 KB)

sachinpuranik2007 · June 28, 2018, 9:09pm

I will check this and get back to you.

njuffa · June 28, 2018, 9:25pm

Please remember to also examine the GPU temperature as suggested by txbob. I don’t know what exactly to expect for a GTX 1080Ti, but up to 80 deg should still be normal. In general, you would want “GPU Current Temp” sufficiently below “GPU Slowdown Temp”.

When the GPU is running flat out it is in performance state P0. P2 is the highest (?) power saving state. There are also even lower power-saving states, such as P8 and P12. I think P2 is used when neither compute nor 3D-graphics tasks are running on the GPU, and it drives only the operating system’s GUI.

The GPU gets power through the PCIe slot (up to 75W are allowed by the spec, although with most NVIDIA GPUs it is just 40W to 50W) and the rest is supplied via the PCIe power cables (6-pin: up to 75W; 8-pin: up to 150W).

The 23% of fan use seems consistent with the relatively small amount of power dissipated in P2 state, and it would seem to indicate that the fan is working and regulated properly according to power consumption. The fact that power consumption is limited to about 70W suggests to me that the power supply via the PCIe power cable might be missing, for whatever reason, causing the GPU to be limited to the power supplied via the PCIe socket (which OP confirmed the card is firmly seated in).

It is very difficult to diagnose such issues remotely without access to the machine. Our success rate resolving such issues in these forums is only about 20%. It’s possible that some hardware defect has developed in the PSU or the GPU itself, but I wouldn’t know how to drill down on that remotely. You may want to engage the help of a knowledgeable local person who has physical access to the machine.

njuffa · June 29, 2018, 6:33am

My knowledge about performance states may be outdated. I just noticed the following statement by user “generix” in another thread:

So the GPU staying in P2 state when running CUDA kernels may be expected, but the throttling to the low clock rate reported by OP definitely is not.

vacaloca · June 30, 2018, 3:03am

OP, if you are able to test the GPU on a different system and the behavior follows the GPU it would point to the GPU being defective.

spudz76 · July 31, 2018, 3:10pm

You can also use https://github.com/DeadManWalkingTO/NVidiaProfileInspectorDmW

And shut off the ridiculous Force P2 garbage so it runs P0 all the time like a normal GPU.

Topic		Replies	Views
GTX 1080 Ti falling off bus Linux	19	2329	September 3, 2018
GeForce GTX 1080 Ti falling off bus creating compressed Chia plots CUDA Programming and Performance cuda , ubuntu	18	501	February 7, 2024
GPU performance suddenly drops down twice during learning CUDA Programming and Performance	11	3410	November 10, 2018
GPU is lost during execution of either Tensorflow or Theano code CUDA Programming and Performance	12	12634	March 8, 2020
GPUs are stuck when using multiple GPUs to train CUDA Programming and Performance	4	1906	November 13, 2020
The GPU FAN runs heavily after the process is done. CUDA Setup and Installation	19	4786	July 20, 2017
Performance state switches from P0 to P2 when starting program CUDA Programming and Performance cuda , python , linux	16	9579	October 3, 2024
Controling fan speed of Titan and TitanX with TCC enabled CUDA Programming and Performance	15	5221	December 5, 2022
Limited clock for the new RTX3090Ti + Ubuntu 20.04 CUDA Programming and Performance	15	2937	December 5, 2022
1080ti core clock stays at idle speed with p0 and SW Cap active only with DisplayPort General Topics and Other SDKs	1	14	March 3, 2025

1080GTX Ti GPU clock and power drawn is throttled all of a sudden!

Related topics