Xorg crashes and is unresponsive with driver option NVreg_DynamicPowerManagement=0x02

Hi, everyone.

I own a Dell Precision 7530 with an Intel UHD Graphics P630 and an NVIDIA Quadro RTX 3000 GPU set up in hybrid graphics mode (the BIOS allows me to switch between hybrid, and dGPU-only).

Now, I’ve followed the PRIME offloading section of the NVIDIA Linux driver manual, the Arch Wiki section on the topic, and another forum post here, I have the following in etc/X11/xorg.conf.d/xorg.conf:

Section "ServerLayout"
    Identifier "layout"
    Screen 0 "iGPU"
    Option "AllowNVIDIAGPUScreens"
EndSection

Section "Device"
    Identifier "iGPU"
    Driver "modesetting"
    BusID "PCI:0:2:0"
EndSection

Section "Screen"
    Identifier "iGPU"
    Device "iGPU"
EndSection

Section "Device"
    Identifier "nvidia"
    Driver "nvidia"
    BusID "PCI:1:0:0"
EndSection

Therefore, PRIME offloading works perfectly, and I can even run Proton games at high performance. Arch Linux’s repository also provides a convenient nvidia-prime package that has a useful script, prime-run, aliased to the appropriate command prefixes to run a program on the NVIDIA dGPU.

Next, given my GPU is of the Turing generation (TU106), I have also followed the Runtime D3 power management section in the NVIDIA manual, and now I have the following:

/lib/udev/rules.d/80-nvidia-pm.rules

# Remove NVIDIA USB xHCI Host Controller devices, if present
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x0c0330", ATTR{remove}="1"

# Remove NVIDIA USB Type-C UCSI devices, if present
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x0c8000", ATTR{remove}="1"

# Remove NVIDIA Audio devices, if present
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x040300", ATTR{remove}="1"

# Enable runtime power management for NVIDIA VGA/3D controller devices on driver bind
ACTION=="bind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", TEST=="power/control", ATTR{power/control}="auto"
ACTION=="bind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030200", TEST=="power/control", ATTR{power/control}="auto"

# Disable runtime power management for NVIDIA VGA/3D controller devices on driver bind
ACTION=="unbind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", TEST=="power/control", ATTR{power/control}="on"
ACTION=="unbind", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030200", TEST=="power/control", ATTR{power/control}="on"

/etc/modprobe.d/nvidia.conf

options nvidia "NVreg_DynamicPowerManagement=0x01"

Now, in the last file, setting the option to 0x02 as recommended in the guide and rebooting, makes Xorg crash and become unresponsive whenever:

  1. the tty is switched (Ctrl + Alt + Fn keys),
  2. the OS is shutdown/rebooted from the DE running on Xorg,
  3. or the notebook is sent to sleep.

Setting the option 0x01 works, but Xorg is listed as an active process in nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 440.59       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro RTX 3000     Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   46C    P8     6W /  N/A |     64MiB /  5934MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     15001      G   /usr/lib/Xorg                                 62MiB |
+-----------------------------------------------------------------------------+

My average power draw is also around 15-20 W, which, on my 97 Wh battery, nets only around 5 hours. If the dGPU were to truly power down with RTD3 power management like it does in Windows, then it would net me an additional 4 hours easy, thereby doubling my battery life.

I have also made another post regarding this matter, but I believe I was less clear there. Below is an excerpt of the relevant log entries from that post:

Feb 11 08:26:19.230740 arch-p7530 kernel: nvidia: loading out-of-tree module taints kernel.
Feb 11 08:26:19.230824 arch-p7530 kernel: nvidia: module license 'NVIDIA' taints kernel.
Feb 11 08:26:19.237414 arch-p7530 kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Feb 11 08:26:19.244065 arch-p7530 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 237
Feb 11 08:26:19.244101 arch-p7530 kernel: nvidia 0000:01:00.0: enabling device (0006 -> 0007)
Feb 11 08:26:19.244256 arch-p7530 kernel: nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Feb 11 08:45:00.340617 arch-p7530 kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 237
Feb 11 08:45:00.833955 arch-p7530 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 237
Feb 11 08:45:00.834010 arch-p7530 kernel: nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
Feb 11 08:45:00.887277 arch-p7530 kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  440.59  Thu Jan 30 00:59:18 UTC 2020
Feb 11 08:45:00.890604 arch-p7530 kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Feb 11 08:45:01.720738 arch-p7530 kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
Feb 11 08:45:01.817667 arch-p7530 kernel: Modules linked in: nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) snd_seq_dummy snd_seq snd_seq_device ipmi_devintf ipmi_msghandler msr ccm algif_aead des_generic libdes arc4 algif_skcipher cmac md4 algif_hash af_alg snd_hda_codec_hdmi btusb uvcvideo btrtl btbcm btintel videobuf2_vmalloc videobuf2_memops bluetooth snd_hda_codec_realtek videobuf2_v4l2 snd_hda_codec_generic videobuf2_common videodev ecdh_generic ecc mc joydev snd_sof_pci mousedev snd_sof_intel_byt snd_sof_intel_ipc snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda dell_rbtn snd_sof snd_hda_ext_core hid_multitouch snd_soc_acpi_intel_match iwlmvm dell_wmi dell_laptop snd_soc_acpi hid_generic ledtrig_audio x86_pkg_temp_thermal intel_powerclamp dell_smbios snd_soc_core coretemp iTCO_wdt mei_wdt mei_hdcp iTCO_vendor_support intel_rapl_msr mac80211 snd_compress kvm_intel dell_wmi_descriptor ac97_bus dcdbas i915 intel_wmi_thunderbolt mxm_wmi wmi_bmof dell_smm_hwmon libarc4 snd_pcm_dmaengine
Feb 11 08:45:01.817826 arch-p7530 kernel:  kvm nls_iso8859_1 nls_cp437 snd_hda_intel vfat snd_intel_dspcfg i2c_algo_bit irqbypass crct10dif_pclmul iwlwifi fat snd_hda_codec drm_kms_helper crc32_pclmul ghash_clmulni_intel snd_hda_core snd_hwdep fuse drm cfg80211 aesni_intel snd_pcm ofpart crypto_simd cmdlinepart cryptd intel_spi_pci glue_helper intel_spi intel_cstate snd_timer spi_nor intel_uncore mei_me intel_gtt snd agpgart intel_rapl_perf intel_lpss_pci input_leds e1000e mtd i2c_i801 processor_thermal_device ucsi_acpi syscopyarea rfkill soundcore i2c_nvidia_gpu intel_lpss i2c_hid sysfillrect mei idma64 typec_ucsi intel_rapl_common tpm_crb sysimgblt fb_sys_fops hid intel_pch_thermal intel_soc_dts_iosf typec battery int3403_thermal int340x_thermal_zone tpm_tis tpm_tis_core tpm dell_smo8800 wmi rng_core evdev int3400_thermal intel_hid acpi_thermal_rel mac_hid ac sparse_keymap pkcs8_key_parser crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 rtsx_pci_sdmmc mmc_core serio_raw atkbd libps2 ahci libahci
Feb 11 08:45:01.817921 arch-p7530 kernel:  libata xhci_pci crc32c_intel scsi_mod rtsx_pci xhci_hcd i8042 serio [last unloaded: nvidia]
Feb 11 08:45:01.819291 arch-p7530 kernel:  nv_drm_atomic_helper_disable_all+0xec/0x290 [nvidia_drm]
Feb 11 08:45:01.819370 arch-p7530 kernel:  nv_drm_master_drop+0x22/0x60 [nvidia_drm]
Feb 11 08:45:21.897556 arch-p7530 kernel: nvidia 0000:01:00.0: can't suspend (nv_pmops_runtime_suspend [nvidia] returned -5)
Feb 11 08:55:19.160553 arch-p7530 kernel: nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57d:0 2:2:0:4040
Feb 11 08:55:24.160558 arch-p7530 kernel: nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57d:0 2:2:0:4040
Feb 11 08:55:29.160556 arch-p7530 kernel: nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57d:0 2:2:0:4040
Feb 11 08:55:34.163894 arch-p7530 kernel: nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57d:0 2:2:0:4040
Feb 11 08:55:39.163893 arch-p7530 kernel: nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57d:0 2:2:0:4040

Also attached is my nvidia-bug-report.log.gz file as requested. I do hope we can get to the bottom of this and maybe get some devs on this matter. Thanks!
nvidia-bug-report.log.gz (833 KB)

1 Like

I do hope I’ve provided enough information, and hopefully this issue can be fixed. I see 141 views, but nary a response; I’m not sure if I’m doing something wrong.

1 Like

This is also an issue that I noticed. I’ve been looking for more information on this for ages and it’s at least nice that someone else has reported it!

This is actually a topic that got litlle attention so far though serious issues existing for a long time.
In general

  • don’t use nvidia-smi to show anything, it wakes up the gpu
  • the D3 power management features makes use of the kernel’s general rt pm features
  • those don’t really work reliably (seen suspend power draw of 57W(!))
  • @aplattner mentioned that future driver releases will provide more detailed info through /proc (or /sys) interface

So there’s a lot to be ironed out, not only in the driver but also in the kernel.

Wow. I posted this nearly four months ago, I completely forgot that I even did so.

If the source code were open, we users could inspect it and propose fixes, but we can’t, either. I do hope @aplattner offers some fixes. NVIDIA was supposed to host a talk on open-source during GTC, but that was cancelled, too.

A lot of us just want working notebook GPUs in Linux…