Severe throttling on Thinkpad T14 Gen 1 with GeForce MX330

I am experiencing severe throttling on my NVIDIA GPU. I have a Thinkpad T14 Gen1 with Geforce MX330. I have followed the guides to install the drivers (Howto/NVIDIA - RPM Fusion) and to make my nvidia GPU primary (How to Set Nvidia as Primary GPU on Optimus-based Laptops :: Fedora Docs). I am on version 465.27 of the driver and have a Fedora 34 workstation setup.

I am seeing constant throttling during even idling. Right now, just idling, I am seeing:

nvidia-smi -q -d PERFORMANCE

==============NVSMI LOG==============

Timestamp                                 : Sat May  8 13:19:52 2021
Driver Version                            : 465.27
CUDA Version                              : 11.3

Attached GPUs                             : 1
GPU 00000000:2D:00.0
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Active
        Display Clock Setting             : Not Active

Where SW Thermal Slowdown is indicating that the GPU is throttled, despite being at 59 degrees Celsius. Running glxgears and checking clocks, I get:

nvidia-smi -q -d CLOCK

==============NVSMI LOG==============

Timestamp                                 : Sat May  8 13:23:43 2021
Driver Version                            : 465.27
CUDA Version                              : 11.3

Attached GPUs                             : 1
GPU 00000000:2D:00.0
    Clocks
        Graphics                          : 139 MHz
        SM                                : 139 MHz
        Memory                            : 405 MHz
        Video                             : 544 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1911 MHz
        SM                                : 1911 MHz
        Memory                            : 3504 MHz
        Video                             : 1708 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    SM Clock Samples
        Duration                          : 18446744073709.55 sec
        Number of Samples                 : 100
        Max                               : 1531 MHz
        Min                               : 139 MHz
        Avg                               : 0 MHz
    Memory Clock Samples
        Duration                          : 18446744073709.55 sec
        Number of Samples                 : 100
        Max                               : 3504 MHz
        Min                               : 405 MHz
        Avg                               : 0 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A

So the GPU is clearly being heavily throttled.

My guess is that this is related to the following settings:

nvidia-smi -q -d TEMPERATURE

==============NVSMI LOG==============

Timestamp                                 : Sat May  8 13:25:04 2021
Driver Version                            : 465.27
CUDA Version                              : 11.3

Attached GPUs                             : 1
GPU 00000000:2D:00.0
    Temperature
        GPU Current Temp                  : 56 C
        GPU Shutdown Temp                 : 102 C
        GPU Slowdown Temp                 : 97 C
        GPU Max Operating Temp            : 57 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A

Interestingly, if I enable thermald with the --adaptive flag, I get this:

==============NVSMI LOG==============

Timestamp                                 : Sat May  8 13:29:56 2021
Driver Version                            : 465.27
CUDA Version                              : 11.3

Attached GPUs                             : 1
GPU 00000000:2D:00.0
    Temperature
        GPU Current Temp                  : 56 C
        GPU Shutdown Temp                 : 102 C
        GPU Slowdown Temp                 : 97 C
        GPU Max Operating Temp            : 75 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A

And the throttling goes away and performance is suddenly much improved.

So apparently thermald can change this setting, but I cannot seem to be able to do so manually since “GPUMaxOperatingTempThreshold” is a read-only variable:

nvidia-settings -a GPUMaxOperatingTempThreshold=80

ERROR: The attribute 'GPUMaxOperatingTempThreshold' specified in assignment 'GPUMaxOperatingTempThreshold=80' cannot be assigned (it is a read-only
       attribute).

I am now on Fedora 34 but I saw the exact same problem on Ubuntu 20.10.

I don’t really know what’s going on here, but it seems strange that I should have to run thermald just to escape this throttling problem (and then I still think that 75C is too low to be throttling on. To be honest, I don’t really understand the interplay between GPU Slowdown Temp and GPU Max Operating Temp. It seems to me that they are synonymous.

Here’s the full output from nvidia-smi:

Sat May  8 15:23:05 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 465.27       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:2D:00.0 Off |                  N/A |
| N/A   67C    P0    N/A /  N/A |    578MiB /  2002MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2762      G   /usr/libexec/Xorg                 293MiB |
|    0   N/A  N/A      2953      G   /usr/bin/gnome-shell               88MiB |
|    0   N/A  N/A      4524      G   ...AAAAAAAAA= --shared-files      134MiB |
|    0   N/A  N/A      5395      G   ...e/Steam/ubuntu12_32/steam       18MiB |
|    0   N/A  N/A      5604      G   ./steamwebhelper                    1MiB |
|    0   N/A  N/A      6303      G   ...AAAAAAAAA= --shared-files        6MiB |
|    0   N/A  N/A      7422      G   anki                               27MiB |
|    0   N/A  N/A     21305      G   /usr/bin/gjs                        2MiB |
+-----------------------------------------------------------------------------+

nvidia-bug-report.log.gz (1.2 MB)

Since thermald is able to set this, this seems to be a system profile configured by acpi. The nvidia driver won’t help in that case. You could check with the thermald developers/source how that’s accomplished.

The NVIDIA driver might be responding to some kind of system profile, but in that case I wonder if what it is doing is really reasonable? thermald is nowadays disabled by default for some modern laptops with internal power management, like the Lenovo Thinkpad T14 that I have, so I would assume that this issue might be affecting many.

If the notebook manufacturer designed it that way, how could the driver override it, possibly damaging the hardware?
https://mjg59.dreamwidth.org/54923.html

Is there any output in dmesg when it throttles or in general?

No, there’s nothing there (in dmesg).

Here is the output of dmesg | grep -iP "nvidia|gpu|graphics|video|thermal":

[    0.000000] Command line: BOOT_IMAGE=(hd1,gpt2)/vmlinuz-5.11.18-300.fc34.x86_64 root=UUID=c1325a0f-113a-4ad5-bf64-be324ff943b8 ro rootflags=subvol=root rhgb quiet rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1
[    0.156686] Reserving Intel graphics memory at [mem 0x8b800000-0x8f7fffff]
[    0.165326] Kernel command line: BOOT_IMAGE=(hd1,gpt2)/vmlinuz-5.11.18-300.fc34.x86_64 root=UUID=c1325a0f-113a-4ad5-bf64-be324ff943b8 ro rootflags=subvol=root rhgb quiet rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1
[    0.332283] mce: CPU0: Thermal monitoring enabled (TM1)
[    0.350029] thermal_sys: Registered thermal governor 'fair_share'
[    0.350031] thermal_sys: Registered thermal governor 'bang_bang'
[    0.350032] thermal_sys: Registered thermal governor 'step_wise'
[    0.350033] thermal_sys: Registered thermal governor 'user_space'
[    0.551119] ACPI: Added _OSI(Linux-Dell-Video)
[    0.551119] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[    0.705965] ACPI: \_SB_.PR00: _OSC native thermal LVT Acked
[    0.827183] pci 0000:00:02.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[    1.111107] efifb: No BGRT, not showing boot graphics
[    1.115656] thermal LNXTHERM:00: registered as thermal_zone0
[    1.115660] ACPI: Thermal Zone [THM0] (79 C)
[    1.834077] ACPI: Video Device [GFX0] (multi-head: yes  rom: no  post: no)
[    1.834348] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input13
[    1.850519] ACPI: Video Device [PEGP] (multi-head: no  rom: yes  post: no)
[    1.850562] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:4c/LNXVIDEO:01/input/input14
[    5.062263] thinkpad_acpi: This ThinkPad has standard ACPI backlight brightness control, supported by the ACPI video driver
[    5.063220] intel_pch_thermal 0000:00:12.0: enabling device (0000 -> 0002)
[    5.072854] proc_thermal 0000:00:04.0: enabling device (0000 -> 0002)
[    5.076995] proc_thermal 0000:00:04.0: Creating sysfs group for PROC_THERMAL_PCI
[    5.232387] nvidia: loading out-of-tree module taints kernel.
[    5.232400] nvidia: module license 'NVIDIA' taints kernel.
[    5.278696] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    5.299920] nvidia-nvlink: Nvlink Core is being initialized, major device number 511
[    5.300795] nvidia 0000:2d:00.0: enabling device (0006 -> 0007)
[    5.442330] RAPL PMU: hw unit of domain pp1-gpu 2^-14 Joules
[    5.539976] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  465.27  Thu Apr 22 23:21:03 UTC 2021
[    5.568987] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[    5.576283] nvidia-uvm: Loaded the UVM driver, major device number 509.
[    5.596264] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  465.27  Thu Apr 22 23:12:47 UTC 2021
[    5.615240] [drm] [nvidia-drm] [GPU ID 0x00002d00] Loading driver
[    5.637909] thermal thermal_zone7: failed to read out thermal zone (-61)
[    6.298111] videodev: Linux video capture interface: v2.00
[    6.425911] uvcvideo: Found UVC 1.10 device Integrated Camera (04f2:b6d0)
[    6.435983] uvcvideo: Found UVC 1.50 device Integrated Camera (04f2:b6d0)
[    6.753420] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:2d:00.0 on minor 1
[    7.267256] uvcvideo: Found UVC 1.00 device HD Pro Webcam C920 (046d:0892)
[    7.269010] usbcore: registered new interface driver uvcvideo
[    7.269011] USB Video Class driver (1.1.1)

Some more information on things I’ve been trying out but which haven’t helped so far

Hi Johan, I have the exact same problem with my T14 Gen 1 under Ubuntu 21. I may have solved it using the “acpi call” method described here. In Ubuntu, this requires:

apt-get install acpi-call-dkms
modprobe acpi_call
echo '\_SB.PCI0.LPCB.EC._Q6D' | tee /proc/acpi/call >/dev/null

I then get

❯ nvidia-settings -q GPUMaxOperatingTempThreshold 
Attribute 'GPUMaxOperatingTempThreshold' (eriq-ThinkPad-T14-Gen-1:0.0): 77.

Meaning that I can use the graphics chip up to 77° C instead of 57° C before.

Thanks for the tip! I still think 77 seems low for throttling though, but maybe I’m wrong?

Too low?Not at all! See this,with RTX3080 Laptop:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44 Driver Version: 495.44 CUDA Version: 11.5 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … Off | 00000000:01:00.0 Off | N/A |
| N/A 42C P0 N/A / N/A | 5MiB / 16125MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 958 G /usr/lib/Xorg 4MiB |
±----------------------------------------------------------------------------+

Temperature
    GPU Current Temp                  : 42 C
    GPU Shutdown Temp                 : 103 C
    GPU Slowdown Temp                 : 100 C
    GPU Max Operating Temp            : 87 C
    GPU Target Temperature            : N/A
    Memory Current Temp               : N/A
    Memory Max Operating Temp         : N/A

But I encount simular problem,while do some deep learning task,SW power cap is actived and power is limited to 115W.But on win10 ,It can run up to 165W.

Thanks EriqJB. I found that your suggestion to use the kernal module acpi_call solved the problem of increasing (essentially activating) my GPU performance. Curiously your command obtains 87°C as the max temperature on my system (ostensibly the same MX330), rather than 77°C.

It’s also curious that after running this command, the Intel integrated GPU also got much better performance.

Judging by glxgears -info with vertical sync deactivated (vblank_mode=0 for intel or __GL_SYNC_TO_VBLANK=0 for nvidia) both GPUs scored around 1500 fps with the default max temperature 57°C (which is also the standard background temperature when the GPU is not used). When I raise the max temperature as you suggested, then nvidia rises to 5000 fps, but intel rises to 10000 fps with glxgears.

In other applications nvidia surpasses intel as expected, e.g. 40-60 fps nvidia compared to 30-40 fps intel (both getting only 15 fps in that application without the GPU max temperature tweak).

I’m half worried the laptop might melt when running the GPU up to 87°C (in practice it got up to 78°C). And looks like this setting raises performance for the Intel intergrated GPU as well as the Nvidia GPU. I haven’t found documentation for this _SB.PCI0.LPCB.EC._Q6D setting. It would be helpful to know what it does, and to be more confident that it’s not going to brick the laptop.

Is there any more documentation for this acpi_call setting apart from the wiki.archlinux.org reference that @EriqJB linked to?