RTX 2070 + Fedora 31: nvidia_modeset error freezes computer just before login


I’m experimenting an error -which I’ll describe next- with the following Linux system:

Fedora 31 - kernel: 5.5.17-200.fc31.x86_64
GPU is ASUS RTX 2070 SUPER with drivers 440.82, also CUDA 10.1 and CuDNN 7.6.5
i7 8 8700
32GB ram (2dims)
Aorus Pro MotherBoard

My RTX 2070 runs headless: my monitor is connected to the onboard video card.

The GPU works perfectly, I can train neural networks with Tensorflow GPU, play games using the GPU, no problems with my hardware I guess, never had a single freeze and temperatures are always fine. The thing is that 1 time every 3 or 4, when I start my computer it will freeze before entering the desktop. They keyboard is dead, I cannot move to a different TTY, it will neigther turn on the CAPS LOCK light when pressing the key, I can only press the power button for 7 seconds and force the computer tu turn off.

When this happens, after login in, I will see an error message in the Fedora’s Problem Reporting:

A kernel problem occurred, but your kernel has been tainted (flags:POE). Explanation:
P - Proprietary module has been loaded.
O - Out-of-tree module has been loaded.
E - Unsigned module has been loaded.
Kernel maintainers are unable to diagnose tainted reports. Tainted modules: nvidia_drm,nvidia_modeset,nvidia.

And if I hit “DETAILS” I see all this reports:

reason   WARNING: CPU: 2 PID: 820451 at mm/vmalloc.c:2282 __vunmap+0x1e9/0x210 [nvidia_modeset]

WARNING: CPU: 2 PID: 820451 at mm/vmalloc.c:2282 __vunmap+0x1e9/0x210
Modules linked in: nvidia_uvm(OE) ipmi_devintf rfcomm ccm xt_CHECKSUM xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat intel_rapl_msr intel_rapl_common snd_sof_pci snd_sof_intel_byt snd_sof_intel_ipc snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda x86_pkg_temp_thermal intel_powerclamp snd_sof coretemp snd_soc_skl kvm_intel snd_soc_sst_ipc snd_soc_sst_dsp snd_hda_ext_core snd_soc_acpi_intel_match snd_hda_codec_hdmi snd_soc_acpi kvm snd_hda_codec_realtek ucsi_ccg snd_hda_codec_generic typec_ucsi
 irqbypass iTCO_wdt ledtrig_audio iTCO_vendor_support iwlmvm snd_soc_core mei_hdcp typec snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg crct10dif_pclmul mac80211 crc32_pclmul snd_usb_audio snd_hda_codec libarc4 ghash_clmulni_intel snd_hda_core snd_usbmidi_lib intel_cstate intel_uncore iwlwifi intel_rapl_perf btusb snd_rawmidi snd_hwdep btrtl snd_seq btbcm pcspkr btintel wmi_bmof intel_wmi_thunderbolt snd_seq_device i2c_i801 cfg80211 bluetooth joydev snd_pcm mc snd_timer snd mei_me ecdh_generic ecc mei rfkill soundcore i2c_nvidia_gpu ie31200_edac intel_pch_thermal acpi_pad ip_tables nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) i915 ipmi_msghandler i2c_algo_bit mxm_wmi drm_kms_helper e1000e crc32c_intel nvme drm nvme_core wmi video pinctrl_cannonlake pinctrl_intel fuse
CPU: 2 PID: 820451 Comm: kworker/u24:6 Tainted: P        W  OE     5.5.17-200.fc31.x86_64 #1
Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS PRO WIFI/Z390 AORUS PRO WIFI-CF, BIOS F10 06/05/2019
Workqueue: events_unbound async_run_entry_fn
RIP: 0010:__vunmap+0x1e9/0x210
Code: 41 5d 41 5e e9 78 37 03 00 31 d2 31 f6 48 c7 c7 ff ff ff ff e8 d8 fc ff ff eb b5 48 89 fe 48 c7 c7 e8 36 39 9c e8 f9 51 e3 ff <0f> 0b 5b 5d 41 5c 41 5d 41 5e c3 4c 89 e6 48 c7 c7 10 37 39 9c e8
RSP: 0000:ffffb9e60c6afcd8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff9372771d9008 RCX: 0000000000000007
RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff93729e299cc0
RBP: 0000000000000a20 R08: 00007cae180e423a R09: ffffffff9d25fc64
R10: 000000000000061e R11: 000000000001f074 R12: ffff936cb80e9a20
R13: 0000000000000004 R14: ffff9372771da008 R15: ffff9372771d9008
FS:  0000000000000000(0000) GS:ffff93729e280000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000007cb60a001 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 _nv002417kms+0xea/0x150 [nvidia_modeset]
 ? _nv000328kms+0x2d/0x1d0 [nvidia_modeset]
 ? _nv002227kms+0x2d5/0x6d0 [nvidia_modeset]
 ? nvKmsResume+0x43/0x80 [nvidia_modeset]
 ? nvkms_resume+0x1b/0x40 [nvidia_modeset]
 ? nvidia_resume+0x67/0x70 [nvidia]
 ? pci_pm_thaw+0x80/0x80
 ? nv_pmops_resume+0xf/0x20 [nvidia]
 ? dpm_run_callback+0x4f/0x140
 ? device_resume+0x136/0x200
 ? async_resume+0x19/0x50
 ? async_run_entry_fn+0x39/0x160
 ? process_one_work+0x1b4/0x370
 ? worker_thread+0x50/0x3c0
 ? kthread+0xf9/0x130
 ? process_one_work+0x370/0x370
 ? kthread_park+0x90/0x90
 ? ret_from_fork+0x35/0x40

I’m not even close to understanding what all this reportsare talking about, but it seems that the Nvidia driver module is failing, and also that theres someone trying to allocate memory (malloc) and failing.

Can you please help me debug this error? Tell me if there is any other relevant information about my system, logs or anything that I can share.

This only happens at boot time, the system works perfectly on runtime, I can train models for hours, play games, never had a single crash or freeze during runtime.

Thanks in advance!!

EDIT 1: Adding “nomodeset” to my kernel parameters causes 100% freeze of the system just before entering the graphical environment. Just in case, this is my grub config file (/etc/sysconfig/grub):

GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_CMDLINE_LINUX="resume=/dev/mapper/fedora_localhost--live-swap rd.lvm.lv=fedora_localhost-live/root rd.lvm.lv=fedora_localhost-live/swap rhgb quiet rd.driver.blacklist=nouveau"

Setting ‘nomodeset’ kills the intel driver so it’s not necessarily anything noteworthy.
Please check for a system bios update first.
Do you perform a full power off or do you suspend/hybernate the system before the crash happens?

1 Like

Hi generix, thanks for your reply. This machine is new, I’ll start off by checking for BIOS updates then.
I always perform full power off, but actually, now that you mention it, at the moment the computer freezes and I’m forced to manually power off the system, I notice that the machine won’t go 100% off, the leds on the motherboard will keep on, just like when you hibernate the system.

I’ll try the updates and tell how it went, thanks again.