Nvidia-smi -gtt doesn't work 535.104.05

Arch Linux Desktop PC
6.4.12-zen1-1-zen
Happens on both open and proprietary versions of 535.104.05
GPU 0: NVIDIA GeForce RTX 3060 Ti (UUID: GPU-ba73bc75-4c91-6012-1365-c8e673737f6b)

Steps to reproduce:

  1. nvidia-smi -gtt 65
  2. run any heavy graphical app

Expected behavior:
GPU starts to throttle at set temperature, temperature doesn’t rise above set value.

Previously setting did work as expected.

Upload on this forum never works for me so here is nvidia-bug-report:
https://github.com/NVIDIA/open-gpu-kernel-modules/files/12440800/nvidia-bug-report.log.gz

Original thread on github:

@ewbteewbte
Thanks for writing to us, I have filed a bug 4260165 internally for tracking purpose.
I will try to replicate issue on my test system first and update on further proceedings.

Setup - Dell Precision T7610 + Genuine Intel(R) CPU @ 2.30GHz + Ubuntu 22.04.1 LTS + kernel 5.19.0-46-generic + NVIDIA GeForce GTX 1650 SUPER + Driver 535.104.05 + Display DELL G3223D
I tried below steps and seeing temp throttles to 66-67 at maximum, can you please confirm if you are seeing similar range or it increases further in your setup.

  1. Run command “nvidia-smi -gtt 65”
  2. Launched 5 instances of Unigine heaven benchmark and GPU temp max throttles to 66-67
  3. Tried above 2 steps couple of times and observed same behavior.
  4. Later I rebooted system and then ran 5 instances of benchmark, temp quickly throttles to 74-75C

It does increase up to 87C, after which I quit benchmark app or game because I don’t want to risk the chance of damaging my GPU.
Normally I use “-gtt 80” and my GPU never surpassed 80C since I bought it a year ago.
Anyway, I provided the log file, shouldn’t it be enough?

@ewbteewbte
Please confirm benchmark which you tried.

unigine superposition, elden ring, dark souls 3

@ewbteewbte
Do you know the last passing driver where issue doesn’t persists.

Hard to tell, last time i used demanding apps was in April or May.
I run all games with 60fps limit, lately I was only playing games like project zomboid which do not utilize gpu much so I couldn’t notice the change in behavior until I tried gpu heavy games again.

@ewbteewbte
Is it possible for you to test with 530 branch driver or even bit older to see if problem exists in earlier branches as well.
I am still not able to repro issue on my couple of test systems.

It is not possible. Could it be 30series specific?

@amrits is it tied to coolbits setting? I always used “12”

@ewbteewbte
I am seeing similar behavior with + Arch Linux + kernel 6.4.12-arch1-1 + NVIDIA GeForce RTX 3080 + Driver 535.104.05 where GPU temperature peaks around 74 after running Unigine Superposition benchmark.
Shall check for the cause and update.

Good! Just in case, with 535.113.01 issues is still present.

545.29.02 issue is still present.

545.29.06 issue is still present.

550.54.14 it does attempt to lower the gpu clocks a little bit but temperature is still able to surpass the value set by -gtt
(for instance from 2100Mhz it drops to about 1900 something)

Hi All,
We have analyzed the issue from our local repro and observed that thermal policy is functioning as per the expectations. However, the workload is too intense to further reduce the temperature. Accordingly, we need to revise the thermal settings.

@amrits

The issue is still present in 560.35.05.
I guess it isn’t fixed in release 570 as well.

Can you please acknowledge the issue, raise the severity and fix this?
It’s been 2 years already and I find it really unacceptable.

As told already by others, the temperature target is almost disregarded which makes it useless.
Please look at how the thermal policy works in the Windows drivers and replicate the same.

The GPU temp in Windows even with very intensive workloads doesn’t go more than a few degrees over the target.

I use an RTX3090 in my NAS, Debian Bullseye, for inference; very high thermal constraint due to the small form factor.
The GPU the target temp (65c), is not even remotely respected, it goes over by 15c-20c.
This is doing translations with a 4b gemma3 model on Ollama which doesn’t cause more than 40% GPU utilization.
It’s not a benchmark, it’s not even a very intensive tensor workload.

This is the only way to reliably limit the temperature without depending on external factors.
Right now, I’m forced to use the power limit and, of course, I have to adjust the limit based on ambient temperature.
Which means that I will discover it’s too high only after a crash.
Not only that, the power limit as well is completely bugged.
Right now I have to set 175W and below 200W the performances drops massively.
The GPU clock is almost always around 400-500 MHz, with short burst to 1500 MHz, and despite that the temps are still high; it’s just plainly horrible.

Only one option left, using the -lgc switch to limit the clock.
It works perfectly and with a minimal performance drop (the temperature during inference goes up mostly because of the boost clock).
I can set 0,1710' and keep even in the worst conditions decent temperatures (my target is max 75c) with excellent performances.
But even the clock limit it’s bugged!
It’s my home NAS and the idle power consumption is crucial, I don’t like and I don’t want to waste energy.
The lower limit at 0, as expected, should allow the GPU to go in idle, 0 clock. But it doesn’t.
The clock in idle is 200 MHz and something; which means 25W in idle instead of 11W.
Additional 14W in idle for absolutely nothing, 24h, it’s a crime.

Please try to fix it, you could think it’s irrelevant but it’s crucial for a large part of your customer base.