GPU throttling at low temperature

ivan_v2003 · October 20, 2024, 7:54am

I have a 3090 that is throttling the graphics cores and I’m trying to figure out if the card is bad or if it’s something else.

The nvtop screenshot shows that GPU0 cores are running at only 28MHz under load at 100% usage and a low temp of 63C. Memory is at 9501MHz, which is fine, I think. It is only drawing 127 watts.

I also tried the python wrapper for nvml to read the throttle reasons. I check the driver version, get a handle to device 0, and then read the throttle reasons. The 3 reasons (0x40, 0x20, and 0x8) are below. But since it’s low temp (63C), I don’t know why it should throttle.

>>> from pynvml import *
>>> nvmlInit()
>>> print(f"Driver Version: {nvmlSystemGetDriverVersion()}")
Driver Version: 550.120
>>> handle = nvmlDeviceGetHandleByIndex(0)
>>> print(f"0x{nvmlDeviceGetCurrentClocksThrottleReasons(handle):0x}")
0x68

define nvmlClocksEventReasonSwThermalSlowdown 0x0000000000000020LL
SW Thermal Slowdown
The current clocks have been optimized to ensure the the following is true:

Current GPU temperature does not exceed GPU Max Operating Temperature
Current memory temperature does not exceeed Memory Max Operating Temperature

define nvmlClocksThrottleReasonHwThermalSlowdown 0x0000000000000040LL
HW Thermal Slowdown (reducing the core clocks by a factor of 2 or more) is engaged
This is an indicator of:

temperature being too high

define nvmlClocksThrottleReasonHwSlowdown 0x0000000000000008LL
HW Slowdown (reducing the core clocks by a factor of 2 or more) is engaged
This is an indicator of:

temperature being too high
External Power Brake Assertion is triggered (e.g. by the system power supply)
Power draw is too high and Fast Trigger protection is reducing the clocks
May be also reported during PState or clock change

ivan_v2003 · October 20, 2024, 7:58am

Also, here are some other settings:

$ nvidia-smi -q -d CLOCK -i 0

==============NVSMI LOG==============

Timestamp                                 : Sun Oct 20 00:58:30 2024
Driver Version                            : 550.120
CUDA Version                              : 12.4

Attached GPUs                             : 2
GPU 00000000:21:00.0
    Clocks
        Graphics                          : 0 MHz
        SM                                : 0 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2115 MHz
        SM                                : 2115 MHz
        Memory                            : 9751 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    SM Clock Samples
        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found
    Memory Clock Samples
        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A

rs277 · October 20, 2024, 6:44pm

If you haven’t already, it might be worth checking the condition and state of the power connector. If the card’s not getting full power, reduced clocks can be the result.

ivan_v2003 · October 21, 2024, 1:24am

I ended up finding this github repository that allows inspecting memory that holds some additional sensor data, specifically the hotspot temperature. For whatever reason, Linux doesn’t seem to have built-in tools for this. Anyway, it shows that the hotspot temperature is hitting 104C, which is well above the graphics and vram temperatures of 60C and 52C and clear reason for throttling.

The temperature difference between graphics or vram and hotspot should really be no more than 10-20C with decent heat spreading. My interpretation is that this basically says I should open the GPU and see if the heat pads are poorly placed. I’ll update how that turns out.

rs277 · October 21, 2024, 7:00am

Nice find. Interesting to note the fan goes to 100% as well. Was going to suggest heatsink may be clogged, but you’d expect the overall temp to be elevated as well. I’ll be interested to what you resolve.

ivan_v2003 · October 25, 2024, 6:27am

Just RMA’ed, unfortunately. It was an Amazon “refurbish” and I didn’t want the hassle of refund complications if repadding doesn’t resolve it and I had opened it, which would be visible due to the no-tamper stickers that cover screw holes.

Topic		Replies	Views
GPU throttling? Video Processing & Optical Flow	1	809	November 18, 2019
Monitoring critical temperatures System Management and Monitoring (NVML)	1	4257	March 30, 2015
GPU temperature keeps increasing just with a single memory allocation. CUDA 4.0 + CUDA Programming and Performance	7	15567	April 6, 2011
GeForce TITAN throttles itself at 80c Linux	20	9888	June 25, 2013
Consequences of insufficient GPU power CUDA Programming and Performance	12	7573	December 10, 2009
Measure GPU temperature in Linux ? CUDA Programming and Performance	6	108283	October 6, 2009
Tesla D870 GPU core temp Is there a way to read it? CUDA Programming and Performance	8	6971	April 11, 2008
Is there an easy way to read GPU temperature? CUDA Programming and Performance	5	7186	September 22, 2009
ideas plz CUDA Programming and Performance	3	3746	April 28, 2007
nvapi gpu thermal info CUDA Programming and Performance	0	737	July 9, 2015

GPU throttling at low temperature

Related topics