GPU throttling at low temperature

I have a 3090 that is throttling the graphics cores and I’m trying to figure out if the card is bad or if it’s something else.

The nvtop screenshot shows that GPU0 cores are running at only 28MHz under load at 100% usage and a low temp of 63C. Memory is at 9501MHz, which is fine, I think. It is only drawing 127 watts.

I also tried the python wrapper for nvml to read the throttle reasons. I check the driver version, get a handle to device 0, and then read the throttle reasons. The 3 reasons (0x40, 0x20, and 0x8) are below. But since it’s low temp (63C), I don’t know why it should throttle.

>>> from pynvml import *
>>> nvmlInit()
>>> print(f"Driver Version: {nvmlSystemGetDriverVersion()}")
Driver Version: 550.120
>>> handle = nvmlDeviceGetHandleByIndex(0)
>>> print(f"0x{nvmlDeviceGetCurrentClocksThrottleReasons(handle):0x}")
0x68

define nvmlClocksEventReasonSwThermalSlowdown 0x0000000000000020LL
SW Thermal Slowdown
The current clocks have been optimized to ensure the the following is true:

  • Current GPU temperature does not exceed GPU Max Operating Temperature
  • Current memory temperature does not exceeed Memory Max Operating Temperature

define nvmlClocksThrottleReasonHwThermalSlowdown 0x0000000000000040LL
HW Thermal Slowdown (reducing the core clocks by a factor of 2 or more) is engaged
This is an indicator of:

  • temperature being too high

define nvmlClocksThrottleReasonHwSlowdown 0x0000000000000008LL
HW Slowdown (reducing the core clocks by a factor of 2 or more) is engaged
This is an indicator of:

  • temperature being too high
  • External Power Brake Assertion is triggered (e.g. by the system power supply)
  • Power draw is too high and Fast Trigger protection is reducing the clocks
  • May be also reported during PState or clock change

Also, here are some other settings:

$ nvidia-smi -q -d CLOCK -i 0

==============NVSMI LOG==============

Timestamp                                 : Sun Oct 20 00:58:30 2024
Driver Version                            : 550.120
CUDA Version                              : 12.4

Attached GPUs                             : 2
GPU 00000000:21:00.0
    Clocks
        Graphics                          : 0 MHz
        SM                                : 0 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2115 MHz
        SM                                : 2115 MHz
        Memory                            : 9751 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    SM Clock Samples
        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found
    Memory Clock Samples
        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A

If you haven’t already, it might be worth checking the condition and state of the power connector. If the card’s not getting full power, reduced clocks can be the result.

I ended up finding this github repository that allows inspecting memory that holds some additional sensor data, specifically the hotspot temperature. For whatever reason, Linux doesn’t seem to have built-in tools for this. Anyway, it shows that the hotspot temperature is hitting 104C, which is well above the graphics and vram temperatures of 60C and 52C and clear reason for throttling.

The temperature difference between graphics or vram and hotspot should really be no more than 10-20C with decent heat spreading. My interpretation is that this basically says I should open the GPU and see if the heat pads are poorly placed. I’ll update how that turns out.

Nice find. Interesting to note the fan goes to 100% as well. Was going to suggest heatsink may be clogged, but you’d expect the overall temp to be elevated as well. I’ll be interested to what you resolve.

Just RMA’ed, unfortunately. It was an Amazon “refurbish” and I didn’t want the hassle of refund complications if repadding doesn’t resolve it and I had opened it, which would be visible due to the no-tamper stickers that cover screw holes.