I have a 3090 that is throttling the graphics cores and I’m trying to figure out if the card is bad or if it’s something else.
The nvtop screenshot shows that GPU0 cores are running at only 28MHz under load at 100% usage and a low temp of 63C. Memory is at 9501MHz, which is fine, I think. It is only drawing 127 watts.
I also tried the python wrapper for nvml to read the throttle reasons. I check the driver version, get a handle to device 0, and then read the throttle reasons. The 3 reasons (0x40, 0x20, and 0x8) are below. But since it’s low temp (63C), I don’t know why it should throttle.
define nvmlClocksEventReasonSwThermalSlowdown 0x0000000000000020LL
SW Thermal Slowdown
The current clocks have been optimized to ensure the the following is true:
Current GPU temperature does not exceed GPU Max Operating Temperature
Current memory temperature does not exceeed Memory Max Operating Temperature
define nvmlClocksThrottleReasonHwThermalSlowdown 0x0000000000000040LL
HW Thermal Slowdown (reducing the core clocks by a factor of 2 or more) is engaged
This is an indicator of:
temperature being too high
define nvmlClocksThrottleReasonHwSlowdown 0x0000000000000008LL
HW Slowdown (reducing the core clocks by a factor of 2 or more) is engaged
This is an indicator of:
temperature being too high
External Power Brake Assertion is triggered (e.g. by the system power supply)
Power draw is too high and Fast Trigger protection is reducing the clocks
May be also reported during PState or clock change
$ nvidia-smi -q -d CLOCK -i 0
==============NVSMI LOG==============
Timestamp : Sun Oct 20 00:58:30 2024
Driver Version : 550.120
CUDA Version : 12.4
Attached GPUs : 2
GPU 00000000:21:00.0
Clocks
Graphics : 0 MHz
SM : 0 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2115 MHz
SM : 2115 MHz
Memory : 9751 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
SM Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Memory Clock Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
If you haven’t already, it might be worth checking the condition and state of the power connector. If the card’s not getting full power, reduced clocks can be the result.
I ended up finding this github repository that allows inspecting memory that holds some additional sensor data, specifically the hotspot temperature. For whatever reason, Linux doesn’t seem to have built-in tools for this. Anyway, it shows that the hotspot temperature is hitting 104C, which is well above the graphics and vram temperatures of 60C and 52C and clear reason for throttling.
The temperature difference between graphics or vram and hotspot should really be no more than 10-20C with decent heat spreading. My interpretation is that this basically says I should open the GPU and see if the heat pads are poorly placed. I’ll update how that turns out.
Nice find. Interesting to note the fan goes to 100% as well. Was going to suggest heatsink may be clogged, but you’d expect the overall temp to be elevated as well. I’ll be interested to what you resolve.
Just RMA’ed, unfortunately. It was an Amazon “refurbish” and I didn’t want the hassle of refund complications if repadding doesn’t resolve it and I had opened it, which would be visible due to the no-tamper stickers that cover screw holes.