Nvidia-smi -gtt option since 460.27 causes major performance issues on laptops

The value of -gtt option of nvidia-smi defaults at 75C and it’s not mutable for most laptop GPUs. Most laptop GPUs are designed to be run at higher than 75C for the preset fan profile, causing laptops to be running in their lowest power state most of the time.

I’ve tried all versions of stable driver releases since the first release of 460, all of them have this issue up to the latest 465 release. The last usable driver is 455.45.01, and it’s only usable up to the 5.4 LTS kernel.

I don’t really understand the exact issue, can you please explain it a bit in-depth?

This issue can be reproduced easily by running CUDA load in hybrid mode on an optimus laptop, just watch the GPU temperature reaches 75C, and it throttles to the lowest power state.

The new -gtt --gpu-target-temp option since 460 drivers in nvidia-smi controls the temperature which GPU thermal throttle will occur in degree celsius. However, the value is immutable for most laptop GPUs, and while they are immutable, this value seems defaulted at 75C and not ignored by the GPU, reducing the clock and memory frequency to the lowest power state when GPU temperature reaches 75C. Laptop GPUs are hence throttled way too early in their healthy operating temperature of 75C, with fan profiles of laptops that barely spin up the fans at 75C, GPUs affected by this issue can only operate at high performance for a short time period by manually setting to max fan speed, otherwise they spend most of time in a loop of scaling up to higher power state, and then throttled until cooled down to 60C.

I’ve tested 2 laptops, one is Acer Triton 300 with 2070 max-q, another is Asus Zephyrus S with 2080 max-q, both exhibits this “feature”. According to GreenWithEnvy readings, 2070 of Triton 300 has critical temperature of 87C, slow down at 93C and shut down at 98C defined in VBIOS, yet it throttles to lowest power state at 75C with 460 or later drivers. 2080 of Zephyrus S does similarly.

I’ve bisected driver versions which the last good version that throttles correctly according to VBIOS definitions is 455.45.01, any 460 and later drivers throttles way too early at 75C. According to the changelogs, the first release of 460 driver introduces -gtt --gpu-target-temp controllable by nvidia-smi, and this issues begins from there, hence I’m quite sure that this new feature is related to the described incorrect throttling behavior.

Ok, got it. So the nvidia driver sets the temperature target on (notebook) gpus which don’t support setting a different target through nvidia-smi.
Just to make sure, you ran nvidia-smi -gtt as root?
Does nvidia-smi -q at least report the target or just N/A?

I just ran those commands on 465.24.02, output as below:

$ sudo nvidia-smi -gtt 90
GPU Target Temperature Threshold not supported for GPU 00000000:01:00.0.
Treating as warning and moving on.
All done.

$ nvidia-smi -q | grep Temp
Temperature
GPU Current Temp : 55 C
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : 87 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A

So the effect is only really noticeable indirectly, from current temp never exceeding 75°C while the clocks are at minimum in that case?
Please create a nvidia-bug-report.log.gz from that stuation (gpu at 100%) and send it to linux-bugs[at]nvidia.com, maybe it will create some attention to this.