Severe throttling on Thinkpad T14 Gen 1 with GeForce MX330

I am experiencing severe throttling on my NVIDIA GPU. I have a Thinkpad T14 Gen1 with Geforce MX330. I have followed the guides to install the drivers (Howto/NVIDIA - RPM Fusion) and to make my nvidia GPU primary (How to Set Nvidia as Primary GPU on Optimus-based Laptops :: Fedora Docs). I am on version 465.27 of the driver and have a Fedora 34 workstation setup.

I am seeing constant throttling during even idling. Right now, just idling, I am seeing:

nvidia-smi -q -d PERFORMANCE

==============NVSMI LOG==============

Timestamp                                 : Sat May  8 13:19:52 2021
Driver Version                            : 465.27
CUDA Version                              : 11.3

Attached GPUs                             : 1
GPU 00000000:2D:00.0
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Active
        Display Clock Setting             : Not Active

Where SW Thermal Slowdown is indicating that the GPU is throttled, despite being at 59 degrees Celsius. Running glxgears and checking clocks, I get:

nvidia-smi -q -d CLOCK

==============NVSMI LOG==============

Timestamp                                 : Sat May  8 13:23:43 2021
Driver Version                            : 465.27
CUDA Version                              : 11.3

Attached GPUs                             : 1
GPU 00000000:2D:00.0
    Clocks
        Graphics                          : 139 MHz
        SM                                : 139 MHz
        Memory                            : 405 MHz
        Video                             : 544 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1911 MHz
        SM                                : 1911 MHz
        Memory                            : 3504 MHz
        Video                             : 1708 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    SM Clock Samples
        Duration                          : 18446744073709.55 sec
        Number of Samples                 : 100
        Max                               : 1531 MHz
        Min                               : 139 MHz
        Avg                               : 0 MHz
    Memory Clock Samples
        Duration                          : 18446744073709.55 sec
        Number of Samples                 : 100
        Max                               : 3504 MHz
        Min                               : 405 MHz
        Avg                               : 0 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A

So the GPU is clearly being heavily throttled.

My guess is that this is related to the following settings:

nvidia-smi -q -d TEMPERATURE

==============NVSMI LOG==============

Timestamp                                 : Sat May  8 13:25:04 2021
Driver Version                            : 465.27
CUDA Version                              : 11.3

Attached GPUs                             : 1
GPU 00000000:2D:00.0
    Temperature
        GPU Current Temp                  : 56 C
        GPU Shutdown Temp                 : 102 C
        GPU Slowdown Temp                 : 97 C
        GPU Max Operating Temp            : 57 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A

Interestingly, if I enable thermald with the --adaptive flag, I get this:

==============NVSMI LOG==============

Timestamp                                 : Sat May  8 13:29:56 2021
Driver Version                            : 465.27
CUDA Version                              : 11.3

Attached GPUs                             : 1
GPU 00000000:2D:00.0
    Temperature
        GPU Current Temp                  : 56 C
        GPU Shutdown Temp                 : 102 C
        GPU Slowdown Temp                 : 97 C
        GPU Max Operating Temp            : 75 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A

And the throttling goes away and performance is suddenly much improved.

So apparently thermald can change this setting, but I cannot seem to be able to do so manually since “GPUMaxOperatingTempThreshold” is a read-only variable:

nvidia-settings -a GPUMaxOperatingTempThreshold=80

ERROR: The attribute 'GPUMaxOperatingTempThreshold' specified in assignment 'GPUMaxOperatingTempThreshold=80' cannot be assigned (it is a read-only
       attribute).

I am now on Fedora 34 but I saw the exact same problem on Ubuntu 20.10.

I don’t really know what’s going on here, but it seems strange that I should have to run thermald just to escape this throttling problem (and then I still think that 75C is too low to be throttling on. To be honest, I don’t really understand the interplay between GPU Slowdown Temp and GPU Max Operating Temp. It seems to me that they are synonymous.

Here’s the full output from nvidia-smi:

Sat May  8 15:23:05 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 465.27       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:2D:00.0 Off |                  N/A |
| N/A   67C    P0    N/A /  N/A |    578MiB /  2002MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2762      G   /usr/libexec/Xorg                 293MiB |
|    0   N/A  N/A      2953      G   /usr/bin/gnome-shell               88MiB |
|    0   N/A  N/A      4524      G   ...AAAAAAAAA= --shared-files      134MiB |
|    0   N/A  N/A      5395      G   ...e/Steam/ubuntu12_32/steam       18MiB |
|    0   N/A  N/A      5604      G   ./steamwebhelper                    1MiB |
|    0   N/A  N/A      6303      G   ...AAAAAAAAA= --shared-files        6MiB |
|    0   N/A  N/A      7422      G   anki                               27MiB |
|    0   N/A  N/A     21305      G   /usr/bin/gjs                        2MiB |
+-----------------------------------------------------------------------------+

nvidia-bug-report.log.gz (1.2 MB)

Since thermald is able to set this, this seems to be a system profile configured by acpi. The nvidia driver won’t help in that case. You could check with the thermald developers/source how that’s accomplished.

The NVIDIA driver might be responding to some kind of system profile, but in that case I wonder if what it is doing is really reasonable? thermald is nowadays disabled by default for some modern laptops with internal power management, like the Lenovo Thinkpad T14 that I have, so I would assume that this issue might be affecting many.

If the notebook manufacturer designed it that way, how could the driver override it, possibly damaging the hardware?
https://mjg59.dreamwidth.org/54923.html

Is there any output in dmesg when it throttles or in general?

No, there’s nothing there (in dmesg).

Here is the output of dmesg | grep -iP "nvidia|gpu|graphics|video|thermal":

[    0.000000] Command line: BOOT_IMAGE=(hd1,gpt2)/vmlinuz-5.11.18-300.fc34.x86_64 root=UUID=c1325a0f-113a-4ad5-bf64-be324ff943b8 ro rootflags=subvol=root rhgb quiet rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1
[    0.156686] Reserving Intel graphics memory at [mem 0x8b800000-0x8f7fffff]
[    0.165326] Kernel command line: BOOT_IMAGE=(hd1,gpt2)/vmlinuz-5.11.18-300.fc34.x86_64 root=UUID=c1325a0f-113a-4ad5-bf64-be324ff943b8 ro rootflags=subvol=root rhgb quiet rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1
[    0.332283] mce: CPU0: Thermal monitoring enabled (TM1)
[    0.350029] thermal_sys: Registered thermal governor 'fair_share'
[    0.350031] thermal_sys: Registered thermal governor 'bang_bang'
[    0.350032] thermal_sys: Registered thermal governor 'step_wise'
[    0.350033] thermal_sys: Registered thermal governor 'user_space'
[    0.551119] ACPI: Added _OSI(Linux-Dell-Video)
[    0.551119] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[    0.705965] ACPI: \_SB_.PR00: _OSC native thermal LVT Acked
[    0.827183] pci 0000:00:02.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[    1.111107] efifb: No BGRT, not showing boot graphics
[    1.115656] thermal LNXTHERM:00: registered as thermal_zone0
[    1.115660] ACPI: Thermal Zone [THM0] (79 C)
[    1.834077] ACPI: Video Device [GFX0] (multi-head: yes  rom: no  post: no)
[    1.834348] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input13
[    1.850519] ACPI: Video Device [PEGP] (multi-head: no  rom: yes  post: no)
[    1.850562] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:4c/LNXVIDEO:01/input/input14
[    5.062263] thinkpad_acpi: This ThinkPad has standard ACPI backlight brightness control, supported by the ACPI video driver
[    5.063220] intel_pch_thermal 0000:00:12.0: enabling device (0000 -> 0002)
[    5.072854] proc_thermal 0000:00:04.0: enabling device (0000 -> 0002)
[    5.076995] proc_thermal 0000:00:04.0: Creating sysfs group for PROC_THERMAL_PCI
[    5.232387] nvidia: loading out-of-tree module taints kernel.
[    5.232400] nvidia: module license 'NVIDIA' taints kernel.
[    5.278696] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    5.299920] nvidia-nvlink: Nvlink Core is being initialized, major device number 511
[    5.300795] nvidia 0000:2d:00.0: enabling device (0006 -> 0007)
[    5.442330] RAPL PMU: hw unit of domain pp1-gpu 2^-14 Joules
[    5.539976] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  465.27  Thu Apr 22 23:21:03 UTC 2021
[    5.568987] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[    5.576283] nvidia-uvm: Loaded the UVM driver, major device number 509.
[    5.596264] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  465.27  Thu Apr 22 23:12:47 UTC 2021
[    5.615240] [drm] [nvidia-drm] [GPU ID 0x00002d00] Loading driver
[    5.637909] thermal thermal_zone7: failed to read out thermal zone (-61)
[    6.298111] videodev: Linux video capture interface: v2.00
[    6.425911] uvcvideo: Found UVC 1.10 device Integrated Camera (04f2:b6d0)
[    6.435983] uvcvideo: Found UVC 1.50 device Integrated Camera (04f2:b6d0)
[    6.753420] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:2d:00.0 on minor 1
[    7.267256] uvcvideo: Found UVC 1.00 device HD Pro Webcam C920 (046d:0892)
[    7.269010] usbcore: registered new interface driver uvcvideo
[    7.269011] USB Video Class driver (1.1.1)

Some more information on things I’ve been trying out but which haven’t helped so far