Dramatic overall performance and heat generation with GeForce GTX 1070 with Max-Q Design

Running driver version: 396.24
OS: pop-os 18.04 (Ubuntu 18.04 with Nvidia drivers by System76)
X Server version: 11.0
X Server vendor version: 1.19.6 (11906000)
NV-CONTROL Version: 1.29
Resolution: 3840x2160
CPU: core i7 8th gen 6 core
Memory: 32GB
Laptop model: Oryx Pro - System76 (new)

When I start the laptop the video card fans produce a constant noise which does not stop after the laptop has booted and is idle (GPU fan speed 4823 according to sensors) and the laptop heats up. When I open a webpage in chrome or firefox, scrolling stutters. It does not go smooth at all, I can even see a refresh line in the middle of my screen. When I move a window, the I can also see a refresh line at the window’s edges.

When I start some development work using Intellij IDEA, the fans start spinning louder and heat increases dramatically as well. I’ve also tried to use a 1920x1080 resolution, but the same happens.

When I switch to the Intel onboard graphics, the laptop is silent and graphics performance increases.

I’ve tried contacting System76 and they don’t seem to have a solution.

I don’t know what settings I could try.

Any tips? Thanks! :)

hestersco@pop-os:~ $ nvidia-smi 
Sat Sep 15 18:44:29 2018       
| NVIDIA-SMI 396.24                 Driver Version: 396.24                    |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 107...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   56C    P0    41W /  N/A |   1185MiB /  8119MiB |     11%      Default |
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0      2289      G   /usr/lib/xorg/Xorg                           665MiB |
|    0      2447      G   /usr/bin/gnome-shell                         330MiB |
|    0      2929      G   ...-token=30A8EBE621B4C6709CC029AEEE6A741F    81MiB |
|    0      3862      G   /usr/bin/nvidia-settings                       2MiB |
|    0      6125      G   /usr/lib/firefox/firefox                       2MiB |
|    0      6401      G   /usr/lib/firefox/firefox                       2MiB |
|    0      7727      G   /usr/lib/firefox/firefox                       2MiB |
|    0      8313      G   ...quest-channel-token=5488405317577830755    97MiB |

nvidia-bug-report.log.gz (183 KB)

To get rid of tearing, use kernel parameter
this enables PRIME sync.
Right after start, your CPU is already at temperature throttling threshold, please check what’s eating the cpu using ‘top’.
Also, the GPU is at 33W power draw without much utilization, something’s fishy there.

Thanks for your help, i configured modeset=1 and the tearing is a lot less/near gone :)

gnome-shell is often between 8-16% CPU usage based on top. I’m not sure what to think of that, maybe some gnome plugins that I should disable (i’ve got plugins like temperature sensors, weather, etc). I also wonder if the processor heat could be influenced by the GPU temperature. When running with nvidia graphics the laptop overall gets hot in the area above my keyboard.

Any suggestions how to investigate the fishy GPU power draw?

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                          
 2548 hesters+  20   0 4446240 428400 102228 S   8,9  1,3   2:02.79 gnome-shell                                                      
 2806 hesters+  20   0 99,770g 315140  90660 S   3,3  1,0   0:51.35 geary                                                            
 2419 hesters+  20   0  991044 281244 190380 S   2,0  0,9   1:03.69 Xorg                                                             
 5945 hesters+  20   0 9710820 813488 315068 S   1,7  2,5   1:44.90 firefox                                                          
  273 root     -51   0       0      0      0 S   1,0  0,0   0:05.33 irq/131-nvidia                                                   
 2809 hesters+  20   0 1164008 317356  48876 S   1,0  1,0   0:21.52 albert                                                           
   83 root      20   0       0      0      0 I   0,7  0,0   0:00.55 kworker/0:1                                                      
 3807 hesters+  20   0  739160  73688  35500 S   0,7  0,2   0:01.57 x-terminal-emul                                                  
 6152 hesters+  20   0 2041180 238584 105948 S   0,7  0,7   1:22.54 Web Content                                                      
 6497 hesters+  20   0 1989312 331264 248648 S   0,7  1,0   0:08.83 Web Content                                                      
    8 root      20   0       0      0      0 I   0,3  0,0   0:01.01 rcu_sched                                                        
  187 root      20   0       0      0      0 I   0,3  0,0   0:00.75 kworker/u24:3                                                    
  304 root      20   0       0      0      0 S   0,3  0,0   0:00.53 system76-polld                                                   
  462 root      20   0       0      0      0 S   0,3  0,0   0:00.19 jbd2/dm-1-8                                                      
  465 root       0 -20       0      0      0 I   0,3  0,0   0:00.30 kworker/u25:4                                                    
 1515 gdm       20   0 4910388 199660  93080 S   0,3  0,6   0:05.33 gnome-shell                                                      
 6013 hesters+  20   0 2004808 220960 107356 S   0,3  0,7   0:14.80 Web Content                                                      
 8040 hesters+  20   0   52576   4320   3436 R   0,3  0,0   0:00.15 top                                                              
    1 root      20   0  225488   9192   6612 R   0,0  0,0   0:02.60 systemd

Using PRIME sync aka nvidia-drm.modeset=1 should eliminate tearing completely, maybe it’s not working and you’re just having a placebo effect now, please create a new nvidia-bug-report.log and attach.
Since your gpu is drawing 33 Watts near idle no wonder your keyboard is getting hot. Depending on the cooling design of your notebook this can influence the temperature of your cpu.
It’s kind of a sticky mess you’re in right now, gnome-shell is taking a lot of cpu due to some ‘optimizations’ for nvidia gpus that are known to be flaky so some distros revert them, I suspect. Still, that shouldn’t have that much of an impact resulting in your gpu draw that much power. The Max-Q design is a reduced power design that should trigger the driver to take a power efficient approach over the normal render maxing one. So there’s definitely something wrong with the driver as well.
Any chance you could revert to a 390.x driver to check if bugs have been introduced in the 396 driver line?
To eliminate cross effects from the gnome-shell/mutter optimizations, could you try with another DE like xfce?

I will try xfce and see what that does. I will also file a support ticket with system76 how I should install driver version 390.x because the nivdia driver is installed as a dependency of the system76-driver-nvidia. I wanna be sure I revert in a safe way because I use this machine for my work as well, so I cannot do without. The installed packages are shown below.

I will try XFCE tomorrow because it’s night time here now. Thanks again :)

hestersco@pop-os:~ $ apt show system76-driver-nvidia 
Package: system76-driver-nvidia
Version: 18.04.29~1536865630~18.04~6372b0f
Priority: extra
Section: utils
Source: system76-driver
Maintainer: System76, Inc. <dev@system76.com>
Installed-Size: 19,5 kB
Depends: system76-driver (>= 18.04.29~1536865630~18.04~6372b0f), ubuntu-drivers-common, nvidia-driver-390
Download-Size: 3.128 B
APT-Manual-Installed: yes
APT-Sources: http://ppa.launchpad.net/system76/pop/ubuntu bionic/main amd64 Packages
Description: Latest nvidia driver for System76 computers
 This dummy package depends on the latest driver tested with and recommended for
 System76 products with an nvidia GPU.
 When this package is installed, you will automatically be upgraded to newer
 nvidia driver versions after System76 has thouroughly tested them.
 This driver will generally depend on a newer nvidia driver than the official
 nvidia-current-updates Ubuntu package.
 If you don't want to be automatically upgraded to newer nvidia drivers, simply
 remove this package.

hestersco@pop-os:~ $ apt list --installed | grep nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-cfg1-396/bionic,now 396.24-0ubuntu1~pop1 amd64 [installed,automatic]
libnvidia-common-396/bionic,bionic,now 396.24-0ubuntu1~pop1 all [installed,automatic]
libnvidia-compute-396/bionic,now 396.24-0ubuntu1~pop1 amd64 [installed,automatic]
libnvidia-decode-396/bionic,now 396.24-0ubuntu1~pop1 amd64 [installed,automatic]
libnvidia-encode-396/bionic,now 396.24-0ubuntu1~pop1 amd64 [installed,automatic]
libnvidia-fbc1-396/bionic,now 396.24-0ubuntu1~pop1 amd64 [installed,automatic]
libnvidia-gl-396/bionic,now 396.24-0ubuntu1~pop1 amd64 [installed,automatic]
libnvidia-ifr1-396/bionic,now 396.24-0ubuntu1~pop1 amd64 [installed,automatic]
nvidia-compute-utils-396/bionic,now 396.24-0ubuntu1~pop1 amd64 [installed,automatic]
nvidia-dkms-396/bionic,now 396.24-0ubuntu1~pop1 amd64 [installed,automatic]
nvidia-driver-390/bionic,now 396.24-0ubuntu1~pop1 amd64 [installed]
nvidia-driver-396/bionic,now 396.24-0ubuntu1~pop1 amd64 [installed,automatic]
nvidia-kernel-common-396/bionic,now 396.24-0ubuntu1~pop1 amd64 [installed,automatic]
nvidia-kernel-source-396/bionic,now 396.24-0ubuntu1~pop1 amd64 [installed,automatic]
nvidia-settings/bionic,now 390.42-0ubuntu1 amd64 [installed,automatic]
nvidia-utils-396/bionic,now 396.24-0ubuntu1~pop1 amd64 [installed,automatic]
system76-driver-nvidia/bionic,bionic,now 18.04.29~1536865630~18.04~6372b0f all [installed]
xserver-xorg-video-nvidia-396/bionic,now 396.24-0ubuntu1~pop1 amd64 [installed,automatic]

I just tried installing xubuntu-desktop, but logging into a xfce session goes leads me back to the login. I can boot back into a gnome session. I chose gdm3 as the display manager, maybe I should switch to lightdm, I will test that tomorrow.

When I run dmesg I see some weird stuff similar to what you mentioned earlier:

[   11.211469] ieee80211 phy0: Selected rate control algorithm 'iwl-mvm-rs'
[   11.211618] (NULL device *): hwmon_device_register() is deprecated. Please convert the driver to use hwmon_device_register_with_info().
[   11.211627] thermal thermal_zone3: failed to read out thermal zone (-61)
[   11.212638] iwlwifi 0000:03:00.0 wlp3s0: renamed from wlan0
[   11.234273] Adding 4193788k swap on /dev/mapper/cryptswap.  Priority:-2 extents:1 across:4193788k SSFS
[   18.108443] mei_me 0000:00:16.0: wait hw ready failed
[   18.108462] mei_me 0000:00:16.0: hw_start failed ret = -62
[   18.108472] mei_me 0000:00:16.0: H_RST is set = 0x80060631
[   20.124544] mei_me 0000:00:16.0: wait hw ready failed
[   20.124547] mei_me 0000:00:16.0: hw_start failed ret = -62
[   20.124557] mei_me 0000:00:16.0: H_RST is set = 0x80060631
[   22.140509] mei_me 0000:00:16.0: wait hw ready failed
[   22.140512] mei_me 0000:00:16.0: hw_start failed ret = -62
[   22.140516] mei_me 0000:00:16.0: reset: reached maximal consecutive resets: disabling the device
[   22.140517] mei_me 0000:00:16.0: reset failed ret = -19
[   22.140518] mei_me 0000:00:16.0: link layer initialization failed.
[   22.140519] mei_me 0000:00:16.0: init hw failure.
[   22.140620] mei_me 0000:00:16.0: initialization failed.
[   22.140829] snd_hda_intel 0000:00:1f.3: enabling device (0000 -> 0002)
[   22.141034] snd_hda_intel 0000:01:00.1: enabling device (0000 -> 0002)
[   22.141096] snd_hda_intel 0000:01:00.1: Disabling MSI
[   22.141100] snd_hda_intel 0000:01:00.1: Handle vga_switcheroo audio client
[   22.141402] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_componen
[   34.115227] Bluetooth: RFCOMM ver 1.11
[   34.778932] rfkill: input handler disabled
[   35.036250] CPU3: Core temperature above threshold, cpu clock throttled (total events = 1)
[   35.036251] CPU9: Core temperature above threshold, cpu clock throttled (total events = 1)
[   35.036252] CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)
[   35.036253] CPU11: Package temperature above threshold, cpu clock throttled (total events = 1)
[   35.036254] CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)
[   35.036255] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
[   35.036256] CPU8: Package temperature above threshold, cpu clock throttled (total events = 1)
[   35.036257] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
[   35.036258] CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
[   35.036259] CPU10: Package temperature above threshold, cpu clock throttled (total events = 1)
[   35.036260] CPU7: Package temperature above threshold, cpu clock throttled (total events = 1)
[   35.036261] CPU4: Package temperature above threshold, cpu clock throttled (total events = 1)
[   35.036263] CPU9: Package temperature above threshold, cpu clock throttled (total events = 1)
[   35.036271] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
[   35.037293] CPU9: Core temperature/speed normal
[   35.037294] CPU3: Core temperature/speed normal
[   35.037294] CPU7: Package temperature/speed normal
[   35.037295] CPU2: Package temperature/speed normal
[   35.037296] CPU8: Package temperature/speed normal
[   35.037297] CPU5: Package temperature/speed normal
[   35.037297] CPU11: Package temperature/speed normal
[   35.037298] CPU6: Package temperature/speed normal
[   35.037299] CPU0: Package temperature/speed normal
[   35.037299] CPU10: Package temperature/speed normal
[   35.037300] CPU4: Package temperature/speed normal
[   35.037301] CPU1: Package temperature/speed normal
[   35.037301] CPU3: Package temperature/speed normal
[   35.037302] CPU9: Package temperature/speed normal
[   46.528922] rfkill: input handler enabled
[   99.696090] rfkill: input handler disabled

While you’re at it, you could open/search an issue report with elementary os for the gnome-shell/mutter nvidia optimizations issue and point them to: https://bugzilla.gnome.org/show_bug.cgi?id=789186

I filed an issue with POP-OS (https://github.com/pop-os/pop/issues/356).

I tried using light-dm but my system didn’t boot anymore, in recovery mode i switched back to gdm3 and things work again. I purged xubuntu-desktop and will give it a try using a pendrive. Although the nividia drivers will not be loaded, I might install it on a USB drive and install the drivers.

I also ran the game Xonotic with all settings on ultra, and it ran very smooth. Which makes me think it’s a gnome/gdm3 issue. Booting in xfce with nvidia drivers did not work.

See this how to get rid of the mei errors by blacklisting the driver:
Unrelated to the temperature problem, though.

Try adding
options nvidia NVreg_RegistryDwords=“OverrideMaxPerf=0x1”
to /etc/modprobe.d/nvidia.conf
For me this forces GPU (1060) in lowest power state with power draw between 3 and 9W.

Note 1:
options nvidia NVreg_RegistryDwords=“PowerMizerEnable=0x1;PerfLevelSrc=0x2222;PowerMizerDefault=0x1;PowerMizerDefaultAC=0x1”
has the same effect for me.

Note 2:
Adding the flags to xorg.conf has no effect for me (driver 396.54) while adding them to driver options works.

1 Like

I disabled the mei_me driver and my system boots faster :D.

I’ve contacted the manufacturer, sent them log files and they are starting and it might be a hardware issue. So they will fix it within warranty.

After configuring nvidia-drm.modeset=1 I cannot use external displays anymore. I fear i’ve got to undo that setting.

@CoBoMi, thanks for your suggestions I might give them a try if the RMA with the manufacturer takes a while.

I don’t think is anything wrong with your laptop. Is just NVIDIA driver has deeply broken power management.

From my tests under Adaptive Power profile as soon as i move the mouse the GPU frequency jumps at full frequency (1.4Ghz GPU, 8Ghz for me) and stays there for 30 seconds then slows down to max power saving by second 35.
So when i use the laptop GPU is almost always in P0 state with power draw of 25W (idle), temperatures over 55C and fans screaming angry at me.

Below are my nvidia-smi readings without and with OverrideMaxPerf at idle with just a terminal and Nvidia Settings opened.

| NVIDIA-SMI 396.54                 Driver Version: 396.54                    |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 1060    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   56C    P0    25W /  N/A |    176MiB /  6078MiB |      7%      Default |

and with OverrideMaxPerf

| NVIDIA-SMI 396.54                 Driver Version: 396.54                    |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 1060    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   48C    P8     2W /  N/A |    318MiB /  6078MiB |      5%      Default |

Yours being a 1070 MaxQ 41W is normal and very close to 40W TDP.
If external monitor works without modeset it seams to me like a configuration/driver issue.

I tried adding options nvidia NVreg_RegistryDwords=“OverrideMaxPerf=0x1” to the nvidia.conf, then it indeed did not heat up that much but performance was the worst. I removed that line again but it stayed in power level 0 after reboot. I then saved something via nvidia-settings and I get stuck after logging in. I had to manually switch to intel graphics on the command line. My nvidia setup is now completelly useless. Tried resetting xorg.conf, no success, tried resetting .nvidia-settings-rc with the backup file, no succcess. I might have to purge and reinstall the nvidia driver package, but I am not sure if that will help. I also need this laptop for work tomorrow so I fear messing it up. I already broke some sweats a moment ago :P

You could chime in here https://devtalk.nvidia.com/default/topic/1002912/very-slow-ramp-down-from-high-to-low-clock-speeds-leading-to-a-significantly-increased-power-consumption/

I fixed my nvidia set-up by removing xorg.conf and deleting .nvidia-settings-rc in my home dir.

It’s back on the old adaptive mode. where it is immediately on level 4 (max perf, max memory transfer rate), after not doing anything it slowsly scales down to 0 where it doesn’t use much power (as explained in the link provided by @birdie).

However @birdie and @CoBoMi, I agree that the power-managment might be deeply broken. but on the highest performance right after boot, I would expect fluent graphics. while scrolling down the answers on this forum isn’t smooth at all, it’s smoother on intel graphics. When I play a game it runs ok though with same heat production. Fact is that my 4-5 year old laptop, also with an nivdia card, running nouveau drivers ran a lot smoother and didn’t heat up so fast.

Just working on my laptop using nvidia drivers almost overheats the system, and the manufacturer has acknowledged this isn’t normal and doesn’t happen with their test models. They also acknowledged it might be a hardware issue as it has happened before, and therefore they are looking into RMA after I provided all the log files while the system was under normal load.

Sorry, somehow concerned about thermals i missed the part about scrolling.
For me even locked at max power setting in P8 state with 607Mhz GPU and 810Mhz memory everything is nice and smooth.

What kind of monitor setup are you using ? External monitor or laptop screen ?
Are you sure is not screen tearing ? Are the bars in this video straight while playing full screen ?

Hi CoBoMi,

Sorry for the late response. that video is indeed showing me screen tearing (doesn’t even have to be at full-screen). When I connect an external monitor, the tearing does not happen there, but it is on my laptop screen. I know this could be a setting issue, but tried different refresh rates, but tearing still happens with each possible setting.

I just got a replacement laptop for the time being and this one is going back for the thermal problem. I will post if the repairs helped.

Thanks for the help :)

I believe that may be the cause of your scrolling shuttering.
Most likely PRIME Sync is not enabled, you can check it with

xrandr --verbose

and look for PRIME Synchronization: 0 (0 disabled, 1 enabled)

PRIME Sync required modeset driver and to enable it you have to add nvidia-drm.modeset=1 to kernel command line or

options nvidia-drm modeset=1

to /etc/modprobe.d/nvidia.conf