@amrits
The issue is still present in 560.35.05.
I guess it isn’t fixed in release 570 as well.
Can you please acknowledge the issue, raise the severity and fix this?
It’s been 2 years already and I find it really unacceptable.
As told already by others, the temperature target is almost disregarded which makes it useless.
Please look at how the thermal policy works in the Windows drivers and replicate the same.
The GPU temp in Windows even with very intensive workloads doesn’t go more than a few degrees over the target.
I use an RTX3090 in my NAS, Debian Bullseye, for inference; very high thermal constraint due to the small form factor.
The GPU the target temp (65c), is not even remotely respected, it goes over by 15c-20c.
This is doing translations with a 4b gemma3 model on Ollama which doesn’t cause more than 40% GPU utilization.
It’s not a benchmark, it’s not even a very intensive tensor workload.
This is the only way to reliably limit the temperature without depending on external factors.
Right now, I’m forced to use the power limit and, of course, I have to adjust the limit based on ambient temperature.
Which means that I will discover it’s too high only after a crash.
Not only that, the power limit as well is completely bugged.
Right now I have to set 175W and below 200W the performances drops massively.
The GPU clock is almost always around 400-500 MHz, with short burst to 1500 MHz, and despite that the temps are still high; it’s just plainly horrible.
Only one option left, using the -lgc
switch to limit the clock.
It works perfectly and with a minimal performance drop (the temperature during inference goes up mostly because of the boost clock).
I can set 0,1710'
and keep even in the worst conditions decent temperatures (my target is max 75c) with excellent performances.
But even the clock limit it’s bugged!
It’s my home NAS and the idle power consumption is crucial, I don’t like and I don’t want to waste energy.
The lower limit at 0, as expected, should allow the GPU to go in idle, 0 clock. But it doesn’t.
The clock in idle is 200 MHz and something; which means 25W in idle instead of 11W.
Additional 14W in idle for absolutely nothing, 24h, it’s a crime.
Please try to fix it, you could think it’s irrelevant but it’s crucial for a large part of your customer base.