555.58.02-10 nvidia-open driver crash

I am using nvidia-open 555.58.02-10 on arch linux (kernel 6.10.2.arch1-1). The nvidia driver crashes with following error:

[   49.333817] ------------[ cut here ]------------
[   49.333826] WARNING: CPU: 6 PID: 1733 at include/linux/rwsem.h:80 follow_pte+0x1de/0x200
[   49.333840] Modules linked in: hid_logitech_hidpp usbhid uhid overlay ccm snd_seq_dummy snd_hrtimer rfcomm snd_seq snd_seq_device nft_masq nft_reject_ipv4 nf_nat_tftp nf_conntrack_tftp tun bridge stp llc cmac algif_hash algif_skcipher af_alg nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables qrtr bnep snd_ctl_led snd_soc_skl_hda_dsp snd_soc_hdac_hdmi snd_soc_intel_hda_dsp_common snd_sof_probes snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_soc_dmic snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic soundwire_intel soundwire_cadence snd_sof_intel_hda_common snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp intel_uncore_frequency intel_uncore_frequency_common snd_sof snd_sof_utils snd_soc_hdac_hda snd_soc_acpi_intel_match soundwire_generic_allocation mousedev joydev snd_soc_acpi soundwire_bus
[   49.333906]  snd_soc_avs hid_sensor_als hid_sensor_trigger snd_soc_hda_codec industrialio_triggered_buffer kfifo_buf snd_hda_ext_core hid_sensor_iio_common x86_pkg_temp_thermal snd_soc_core industrialio intel_powerclamp hid_sensor_custom snd_compress coretemp ac97_bus snd_pcm_dmaengine hid_sensor_hub iwlmvm snd_hda_intel kvm_intel intel_ishtp_hid snd_intel_dspcfg vfat processor_thermal_device_pci fat processor_thermal_device mac80211 uvcvideo snd_intel_sdw_acpi snd_hda_scodec_cs35l41_spi processor_thermal_wt_hint btusb snd_hda_scodec_cs35l41_i2c iTCO_wdt kvm snd_hda_codec videobuf2_vmalloc libarc4 processor_thermal_rfim spi_pxa2xx_platform snd_hda_scodec_cs35l41 dell_laptop hid_multitouch btrtl intel_pmc_bxt dw_dmac uvc snd_hda_core ptp processor_thermal_rapl hid_generic snd_hda_cs_dsp_ctls btintel videobuf2_memops dell_wmi pps_core intel_rapl_msr iwlwifi snd_hwdep videobuf2_v4l2 snd_soc_cs_amp_lib intel_rapl_common dell_smbios rapl iTCO_vendor_support dell_wmi_sysman btbcm ucsi_acpi mei_hdcp mei_pxp dcdbas
[   49.333971]  intel_cstate intel_uncore psmouse dell_smm_hwmon dell_wmi_ddv firmware_attributes_class pcspkr spi_nor snd_pcm cs_dsp btmtk processor_thermal_wt_req dell_wmi_descriptor wmi_bmof videodev intel_pmc_core i2c_i801 typec_ucsi intel_lpss_pci snd_soc_cs35l41_lib snd_timer processor_thermal_power_floor nvidia_drm(OE) mtd bluetooth cfg80211 snd i2c_hid_acpi typec videobuf2_common intel_hid intel_lpss mei_me i2c_smbus int3403_thermal processor_thermal_mbox intel_ish_ipc int3400_thermal pmt_telemetry mc mei thunderbolt nvidia_modeset(OE) crc16 i2c_mux rfkill soundcore idma64 intel_ishtp igen6_edac roles intel_vsec dptf_power int340x_thermal_zone serial_multi_instantiate i2c_hid acpi_pad acpi_tad acpi_thermal_rel pmt_class pinctrl_tigerlake sparse_keymap mac_hid nvidia_uvm(OE) nvidia(OE) sg crypto_user loop nfnetlink zram ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq dm_crypt cbc encrypted_keys trusted asn1_encoder tee xe drm_ttm_helper gpu_sched drm_suballoc_helper drm_gpuvm
[   49.334044]  drm_exec dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel serio_raw sha512_ssse3 sha256_ssse3 atkbd libps2 sha1_ssse3 nvme vivaldi_fmap aesni_intel spi_intel_pci rtsx_pci_sdmmc mmc_core xhci_pci crypto_simd i8042 nvme_core cryptd rtsx_pci nvme_auth spi_intel xhci_pci_renesas serio i915 i2c_algo_bit drm_buddy video wmi ttm drm_display_helper cec intel_agp intel_gtt
[   49.334079] CPU: 6 PID: 1733 Comm: nv_queue Tainted: G     U  W  OE      6.10.2-arch1-1 #1 a727c214dbee27eb0624871a8199f6116f5b74c2
[   49.334085] Hardware name: Dell Inc. XPS 15 9530/09J5GK, BIOS 1.13.0 04/11/2024
[   49.334087] RIP: 0010:follow_pte+0x1de/0x200
[   49.334093] Code: cc cc cc 48 81 e2 00 00 00 c0 48 09 c2 48 f7 d2 48 85 fa 75 20 e8 b2 f5 ff ff 48 8b 35 4b e4 5c 01 48 81 e6 00 00 00 c0 eb 8d <0f> 0b 48 3b 1f 0f 83 50 fe ff ff bd ea ff ff ff eb b6 49 8b 3c 24
[   49.334096] RSP: 0018:ffffa54603237b48 EFLAGS: 00010246
[   49.334101] RAX: 0000000000000000 RBX: 0000794c3d66e000 RCX: ffffa54603237b88
[   49.334104] RDX: ffffa54603237b80 RSI: 0000794c3d66e000 RDI: ffff98d1d4fe5648
[   49.334106] RBP: ffffa54603237bc8 R08: ffffa54603237d20 R09: 0000000000000000
[   49.334108] R10: 0000000000000001 R11: 0000000000000003 R12: ffffa54603237b88
[   49.334110] R13: ffffa54603237b80 R14: ffff98d18d959600 R15: 0000000000000000
[   49.334112] FS:  0000000000000000(0000) GS:ffff98d8ef100000(0000) knlGS:0000000000000000
[   49.334115] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   49.334123] CR2: 000019340291f000 CR3: 00000007cf220000 CR4: 0000000000f50ef0
[   49.334126] PKRU: 55555554
[   49.334128] Call Trace:
[   49.334131]  <TASK>
[   49.334133]  ? follow_pte+0x1de/0x200
[   49.334138]  ? __warn.cold+0x8e/0xe8
[   49.334145]  ? follow_pte+0x1de/0x200
[   49.334155]  ? report_bug+0xff/0x140
[   49.334161]  ? handle_bug+0x3c/0x80
[   49.334168]  ? exc_invalid_op+0x17/0x70
[   49.334170]  ? asm_exc_invalid_op+0x1a/0x20
[   49.334178]  ? follow_pte+0x1de/0x200
[   49.334183]  follow_phys+0x49/0x110
[   49.334190]  untrack_pfn+0x55/0x120
[   49.334193]  unmap_single_vma+0xa6/0xe0
[   49.334201]  zap_page_range_single+0x122/0x1d0
[   49.334209]  unmap_mapping_range+0x116/0x140
[   49.334217]  ? __pfx__main_loop+0x10/0x10 [nvidia 0d1816c0841377128ffa8d815f5887b7782ef6a7]
[   49.334500]  nv_revoke_gpu_mappings+0x67/0xb0 [nvidia 0d1816c0841377128ffa8d815f5887b7782ef6a7]
[   49.334688]  RmHandleIdleSustained+0x3b/0x140 [nvidia 0d1816c0841377128ffa8d815f5887b7782ef6a7]
[   49.335051]  ? gpumgrGetGpu+0x69/0xa0 [nvidia 0d1816c0841377128ffa8d815f5887b7782ef6a7]
[   49.335433]  rm_execute_work_item+0xda/0x150 [nvidia 0d1816c0841377128ffa8d815f5887b7782ef6a7]
[   49.335788]  _main_loop+0x95/0x150 [nvidia 0d1816c0841377128ffa8d815f5887b7782ef6a7]
[   49.335983]  kthread+0xcf/0x100
[   49.335988]  ? __pfx_kthread+0x10/0x10
[   49.335993]  ret_from_fork+0x31/0x50
[   49.335999]  ? __pfx_kthread+0x10/0x10
[   49.336002]  ret_from_fork_asm+0x1a/0x30
[   49.336008]  </TASK>
[   49.336009] ---[ end trace 0000000000000000 ]---
[ 6719.607469] ACPI Error: Aborting method \_SB.PC00.PEG1.NPON due to previous error (AE_AML_LOOP_TIMEOUT) (20240322/psparse-529)
[ 6719.607508] ACPI Error: Aborting method \_SB.PC00.PEG1.PG01._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20240322/psparse-529)
[ 6720.634058] pcieport 0000:00:01.0: broken device, retraining non-functional downstream link at 2.5GT/s
[ 6721.637481] pcieport 0000:00:01.0: retraining failed
[ 6722.880846] pcieport 0000:00:01.0: broken device, retraining non-functional downstream link at 2.5GT/s
[ 6723.880745] pcieport 0000:00:01.0: retraining failed
[ 6723.880760] nvidia 0000:01:00.0: not ready 1023ms after resume; waiting
[ 6724.933994] nvidia 0000:01:00.0: not ready 2047ms after resume; waiting
[ 6727.067401] nvidia 0000:01:00.0: not ready 4095ms after resume; waiting
[ 6731.334301] nvidia 0000:01:00.0: not ready 8191ms after resume; waiting
[ 6739.654277] nvidia 0000:01:00.0: not ready 16383ms after resume; waiting
[ 6756.507640] nvidia 0000:01:00.0: not ready 32767ms after resume; waiting

This is happening during working session (not during suspend/resume). After the crash, some apps (like chrome) generally won’t start for few minutes. Also after the crash, lspci is taking long and when it completes, I don’t see nvidia device anymore.

Same problem with Fedora 6.10.3-200.fc40.x86_64.

[ 27.934445] CPU: 2 PID: 2080 Comm: nv_queue Tainted: P W OE 6.10.3-200.fc40.x86_64 #1
[ 27.934446] Hardware name: Acer Nitro ANV15-51/Sportage_RTH, BIOS V1.09 01/08/2024
[ 27.934447] RIP: 0010:follow_pte+0x1f0/0x220
[ 27.934448] Code: cc cc cc 48 81 e2 00 00 00 c0 48 09 c2 48 f7 d2 48 85 fa 75 20 e8 a0 f4 ff ff 48 8b 35 99 3a 86 01 48 81 e6 00 00 00 c0 eb 89 <0f> 0b 48 3b 1f 0f 83 42 fe ff ff bd ea ff ff ff eb b2 49 8b 3c 24
[ 27.934449] RSP: 0018:ffffaf074188fb68 EFLAGS: 00010246
[ 27.934450] RAX: 0000000000000000 RBX: 00007f099a9cf000 RCX: ffffaf074188fbb0
[ 27.934451] RDX: ffffaf074188fba8 RSI: 00007f099a9cf000 RDI: ffff94d0fe62d6b0
[ 27.934452] RBP: ffffaf074188fbf0 R08: ffffaf074188fd48 R09: 0000000000000000
[ 27.934452] R10: 0000000000000002 R11: 0000000000000168 R12: ffffaf074188fbb0
[ 27.934453] R13: ffffaf074188fba8 R14: ffff94d08c509b80 R15: 0000000000000000
[ 27.934453] FS: 0000000000000000(0000) GS:ffff94d42d100000(0000) knlGS:0000000000000000
[ 27.934454] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 27.934455] CR2: 00007fd6cc746e60 CR3: 0000000119380000 CR4: 0000000000f50ef0
[ 27.934455] PKRU: 55555554
[ 27.934456] Call Trace:
[ 27.934457]
[ 27.934458] ? follow_pte+0x1f0/0x220
[ 27.934459] ? __warn.cold+0x8e/0xe8
[ 27.934460] ? follow_pte+0x1f0/0x220
[ 27.934463] ? report_bug+0xff/0x140
[ 27.934464] ? handle_bug+0x3c/0x80
[ 27.934466] ? exc_invalid_op+0x17/0x70
[ 27.934467] ? asm_exc_invalid_op+0x1a/0x20
[ 27.934469] ? follow_pte+0x1f0/0x220
[ 27.934470] ? unmap_page_range+0x17dc/0x18c0
[ 27.934471] follow_phys+0x49/0x110
[ 27.934473] untrack_pfn+0x55/0x120
[

A similiar issue is at here: Nvidia driver kernel random call trace - #5 by LinuxGaming81734

Same here with Gentoo on kernel 6.10.4. The same happens on both 555.58.02 and 550.107.02:

[ 56.154495] ------------[ cut here ]------------
[ 56.154504] WARNING: CPU: 8 PID: 720 at follow_pte+0x13b/0x150
[ 56.154523] Modules linked in: razermouse(O) razerkbd(O) r8152 mii libphy snd_hda_codec_hdmi snd_ctl_led snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component i915 snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic snd_soc_acpi_intel_match nvidia_drm(O) snd_soc_acpi snd_sof_pci snd_sof_xtensa_dsp nvidia_modeset(O) soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda_common iwlmvm snd_soc_hdac_hda snd_sof_intel_hda_mlink uvcvideo snd_sof_intel_hda snd_sof videobuf2_vmalloc uvc videobuf2_memops videobuf2_v4l2 videobuf2_common btusb btintel snd_sof_utils snd_hda_ext_core snd_hda_intel snd_intel_dspcfg i2c_algo_bit nvidia(O) snd_intel_sdw_acpi drm_buddy iwlwifi snd_hda_codec ttm e1000e snd_hda_core drm_display_helper rtsx_pci_sdmmc mei_pxp mei_hdcp cec vboxnetadp(O) vboxnetflt(O) vboxdrv(O)
[ 56.154616] CPU: 8 PID: 720 Comm: nv_queue Tainted: G U O 6.10.4-gentoo #1
[ 56.154623] Hardware name: Dell Inc. Precision 3581/0W18NX, BIOS 1.14.0 06/12/2024
[ 56.154627] RIP: 0010:follow_pte+0x13b/0x150
[ 56.154636] Code: 00 00 74 0c 48 89 03 31 c0 5b 5d c3 cc cc cc cc 48 8b 7d 00 e8 86 eb d0 00 e8 01 70 e6 ff b8 ea ff ff ff 5b 5d c3 cc cc cc cc <0f> 0b e9 e1 fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90
[ 56.154641] RSP: 0018:ffffb1174510bb98 EFLAGS: 00010246
[ 56.154646] RAX: 00007fe20bef0000 RBX: ffffb1174510bbb0 RCX: ffffb1174510bbb8
[ 56.154649] RDX: 0000000000000000 RSI: 00007fe20bef0000 RDI: ffff9fd3c96d4be0
[ 56.154652] RBP: ffffb1174510bbb8 R08: ffff9fd38292b9c0 R09: 0000000000000000
[ 56.154655] R10: 000f42073095fc80 R11: 0000000000000000 R12: 00007fe20bf00000
[ 56.154665] R13: ffffb1174510bbf8 R14: ffffb1174510bc00 R15: 0000000000000000
[ 56.154667] FS: 0000000000000000(0000) GS:ffff9fe2cf600000(0000) knlGS:0000000000000000
[ 56.154671] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 56.154674] CR2: 00007f9abc949c68 CR3: 0000000c2563e000 CR4: 0000000000f50ef0
[ 56.154678] PKRU: 55555554
[ 56.154680] Call Trace:
[ 56.154685]
[ 56.154691] ? __warn+0x7b/0x120
[ 56.154702] ? follow_pte+0x13b/0x150
[ 56.154709] ? report_bug+0x14a/0x180
[ 56.154720] ? handle_bug+0x3a/0x70
[ 56.154728] ? exc_invalid_op+0x17/0x70
[ 56.154735] ? asm_exc_invalid_op+0x1a/0x20
[ 56.154745] ? follow_pte+0x13b/0x150
[ 56.154752] follow_phys+0x35/0xe0
[ 56.154759] untrack_pfn+0x52/0x120
[ 56.154765] unmap_single_vma+0xa1/0xe0
[ 56.154776] ? nvidia_modeset_resume+0x30/0x250 [nvidia]
[ 56.155018] zap_page_range_single+0xe3/0x190
[ 56.155030] ? _raw_spin_lock_irqsave+0x16/0x50
[ 56.155040] ? os_acquire_spinlock+0xd/0x20 [nvidia]
[ 56.155218] ? portSyncSpinlockAcquire+0x1d/0x50 [nvidia]
[ 56.155490] unmap_mapping_range+0x10c/0x130
[ 56.155501] nv_revoke_gpu_mappings+0x62/0xa0 [nvidia]
[ 56.155651] nv_rdcr4+0x739/0xa10 [nvidia]
[ 56.155990] ? gpumgrGetGpu+0x69/0x370 [nvidia]
[ 56.156324] rm_execute_work_item+0xda/0x140 [nvidia]
[ 56.156658] ? nv_get_kern_phys_address+0x84/0xf0 [nvidia]
[ 56.156793] nvidia_modeset_resume+0xa5/0x250 [nvidia]
[ 56.156951] kthread+0xd3/0x110
[ 56.156959] ? __pfx_kthread+0x10/0x10
[ 56.156965] ret_from_fork+0x2c/0x50
[ 56.156972] ? __pfx_kthread+0x10/0x10
[ 56.156977] ret_from_fork_asm+0x1a/0x30
[ 56.156986]
[ 56.156987] —[ end trace 0000000000000000 ]—

We have seen similar call trace internally and currently investigating the issue.
Shall update once there is further feedback from engineering team.

1 Like

I have just tried nvidia-open 560.31.02 beta and still same issue:

[   51.923561] Hardware name: Dell Inc. XPS 15 9530/09J5GK, BIOS 1.13.0 04/11/2024
[   51.923562] RIP: 0010:follow_pte+0x1de/0x200
[   51.923566] Code: cc cc cc 48 81 e2 00 00 00 c0 48 09 c2 48 f7 d2 48 85 fa 75 20 e8 b2 f5 ff ff 48 8b 35 6b e3 5c 01 48 81 e6 00 00 00 c0 eb 8d <0f> 0b 48 3b 1f 0f 83 50 fe ff ff bd ea ff ff ff eb b6 49 8b 3c 24
[   51.923569] RSP: 0018:ffffaf02851efb48 EFLAGS: 00010246
[   51.923571] RAX: 0000000000000000 RBX: 000073ff65418000 RCX: ffffaf02851efb88
[   51.923573] RDX: ffffaf02851efb80 RSI: 000073ff65418000 RDI: ffff9a317dbd0b80
[   51.923575] RBP: ffffaf02851efbc8 R08: ffffaf02851efd20 R09: 0000000000000000
[   51.923576] R10: 0000000000000200 R11: 0000000000000003 R12: ffffaf02851efb88
[   51.923578] R13: ffffaf02851efb80 R14: ffff9a31418fcd00 R15: 0000000000000000
[   51.923579] FS:  0000000000000000(0000) GS:ffff9a38aee00000(0000) knlGS:0000000000000000
[   51.923581] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   51.923583] CR2: 0000754bb07cd840 CR3: 00000004df820000 CR4: 0000000000f50ef0
[   51.923585] PKRU: 55555554
[   51.923586] Call Trace:
[   51.923587]  <TASK>
[   51.923588]  ? follow_pte+0x1de/0x200
[   51.923591]  ? __warn.cold+0x8e/0xe8
[   51.923594]  ? follow_pte+0x1de/0x200
[   51.923598]  ? report_bug+0xff/0x140
[   51.923601]  ? handle_bug+0x3c/0x80
[   51.923603]  ? exc_invalid_op+0x17/0x70
[   51.923606]  ? asm_exc_invalid_op+0x1a/0x20
[   51.923611]  ? follow_pte+0x1de/0x200
[   51.923614]  follow_phys+0x49/0x110
[   51.923620]  untrack_pfn+0x55/0x120
[   51.923622]  unmap_single_vma+0xa6/0xe0
[   51.923627]  zap_page_range_single+0x122/0x1d0
[   51.923633]  unmap_mapping_range+0x116/0x140
[   51.923637]  ? __pfx__main_loop+0x10/0x10 [nvidia f48f9848f2ec487185c7ef39a7b041212a2125b8]
[   51.923794]  nv_revoke_gpu_mappings+0x67/0xb0 [nvidia f48f9848f2ec487185c7ef39a7b041212a2125b8]
[   51.923923]  RmHandleIdleSustained+0x3b/0x140 [nvidia f48f9848f2ec487185c7ef39a7b041212a2125b8]
[   51.924172]  ? gpumgrGetGpu+0x69/0xa0 [nvidia f48f9848f2ec487185c7ef39a7b041212a2125b8]
[   51.924444]  rm_execute_work_item+0xda/0x150 [nvidia f48f9848f2ec487185c7ef39a7b041212a2125b8]
[   51.924703]  _main_loop+0x95/0x150 [nvidia f48f9848f2ec487185c7ef39a7b041212a2125b8]
[   51.924843]  kthread+0xcf/0x100
[   51.924847]  ? __pfx_kthread+0x10/0x10
[   51.924850]  ret_from_fork+0x31/0x50
[   51.924854]  ? __pfx_kthread+0x10/0x10
[   51.924856]  ret_from_fork_asm+0x1a/0x30
[   51.924860]  </TASK>
[   51.924861] ---[ end trace 0000000000000000 ]---

also, after the crash I am seeing following kernel messages:

[10078.692871] ACPI Error: Aborting method \_SB.PC00.PEG1.NPON due to previous error (AE_AML_LOOP_TIMEOUT) (20240322/psparse-529)
[10078.692901] ACPI Error: Aborting method \_SB.PC00.PEG1.PG01._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20240322/psparse-529)
[10079.722966] pcieport 0000:00:01.0: broken device, retraining non-functional downstream link at 2.5GT/s
[10080.729536] pcieport 0000:00:01.0: retraining failed
[10081.962979] pcieport 0000:00:01.0: broken device, retraining non-functional downstream link at 2.5GT/s
[10082.962880] pcieport 0000:00:01.0: retraining failed
[10082.962893] nvidia 0000:01:00.0: not ready 1023ms after resume; waiting
[10084.016214] nvidia 0000:01:00.0: not ready 2047ms after resume; waiting
[10086.202962] nvidia 0000:01:00.0: not ready 4095ms after resume; waiting
[10090.469677] nvidia 0000:01:00.0: not ready 8191ms after resume; waiting
[10098.789785] nvidia 0000:01:00.0: not ready 16383ms after resume; waiting
[10115.216385] nvidia 0000:01:00.0: not ready 32767ms after resume; waiting
[10149.350192] nvidia 0000:01:00.0: not ready 65535ms after resume; giving up
[10149.350252] nvidia 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[10149.410369] NVRM: GPU at PCI:0000:01:00: GPU-2ca72de5-99b6-705d-0904-458a7021df04
[10149.410372] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[10149.410376] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[10149.410562] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
[10149.410565] NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
[10149.410586] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
[10149.410587] NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
[10149.410589] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
[10149.410590] NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
[10149.410604] NVRM: RmLogGpuCrash: RmLogGpuCrash: failed to save GPU crash data
[10149.410618] NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
[10149.410623] NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
[10149.410624] NVRM: s_dmaPoll_GA102: Error while waiting for Falcon DMA; mode: 0, status: 0x00000065
[10149.410625] NVRM: nvAssertOkFailedNoLog: Assertion failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from s_dmaTransfer_GA102(pGpu, pKernelFlcn, pUcode->imemPa, pUcode->imemVa, srcPhysAddr, pUcode->imemSize, dmaCmd) @ kernel_gsp_falcon_ga102.c:218
[10149.410627] NVRM: s_executeBooterUcode_TU102: failed to execute Booter: status 0x65, mailbox 0x83298000
[10149.410628] NVRM: kgspExecuteBooterLoad_TU102: failed to execute Booter Load: 0x65
[10149.410629] NVRM: nvAssertOkFailedNoLog: Assertion failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from kgspExecuteBooterLoad_HAL(pGpu, pKernelGsp, memdescGetPhysAddr(pKernelGsp->pSRMetaDescriptor, AT_GPU,0)) @ kernel_gsp_tu102.c:1121
[10149.414867] NVRM: nvCheckOkFailedNoLog: Check failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from kgspRestorePowerMgmtState_HAL(pGpu, pKernelGsp) @ gpu_suspend.c:195
[10149.423626] NVRM: nvAssertFailedNoLog: Assertion failed: pEntries != NULL @ gmmu_walk.c:852
[10149.423630] NVRM: nvAssertFailedNoLog: Assertion failed: progress == 1 @ mmu_walk.c:1732
[10149.423642] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 10!
[10149.423643] NVRM: rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d00032; hObject=0xfade0002; paramsStatus=0x00000000; status=0x0000000f
[10149.423644] NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ vaspace_api.c:529
[10149.423650] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 10!
[10149.423651] NVRM: rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d00032; hObject=0xfade0001; paramsStatus=0x00000000; status=0x0000000f
[10154.423096] NVRM: Error in service of callback 
[10154.423102] NVRM: Error in service of callback 

mario156090 at here: (Multiple kernel oopses before suspending caused by nvidia-sleep.sh, Linux 6.10 regression? WARNING: CPU: PID: at include/linux/rwsem.h:80 follow_pte - #2 by mario156090) said that it’s a problem only happens in and after kernel 6.10 . And for my experience, there’s no call trace when I’m using the 6.6 lts kernel,so maybe he’s right.

Just tried latest nvidia-open-560.35.03, issue is still there :(

Here is the latest kernel log.

[   47.050181] WARNING: CPU: 3 PID: 2097 at include/linux/rwsem.h:80 follow_pte+0x1de/0x200
[   47.050187] Modules linked in: hid_logitech_hidpp usbhid overlay ccm uhid snd_seq_dummy snd_hrtimer rfcomm snd_seq snd_seq_device nft_masq nft_reject_ipv4 nf_nat_tftp nf_conntrack_tftp tun cmac algif_hash algif_skcipher af_alg bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables qrtr snd_ctl_led snd_soc_skl_hda_dsp snd_sof_probes snd_soc_hdac_hdmi snd_soc_intel_hda_dsp_common bnep hid_sensor_als hid_sensor_trigger industrialio_triggered_buffer kfifo_buf hid_sensor_iio_common industrialio hid_sensor_custom joydev snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic intel_uncore_frequency intel_uncore_frequency_common dell_laptop snd_hda_scodec_component snd_soc_dmic snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic soundwire_intel soundwire_cadence snd_sof_intel_hda_common snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp
[   47.050233]  snd_sof snd_sof_utils snd_soc_hdac_hda snd_soc_acpi_intel_match soundwire_generic_allocation snd_soc_acpi soundwire_bus x86_pkg_temp_thermal intel_powerclamp snd_soc_avs coretemp snd_soc_hda_codec snd_hda_ext_core snd_soc_core kvm_intel snd_compress iwlmvm ac97_bus uvcvideo snd_pcm_dmaengine videobuf2_vmalloc kvm snd_hda_intel uvc btusb videobuf2_memops snd_intel_dspcfg mac80211 snd_intel_sdw_acpi videobuf2_v4l2 btrtl hid_sensor_hub btintel snd_hda_codec processor_thermal_device_pci iTCO_wdt rapl videodev hid_multitouch dell_wmi spi_pxa2xx_platform btbcm processor_thermal_device libarc4 snd_hda_scodec_cs35l41_spi intel_pmc_bxt ptp snd_hda_core intel_ishtp_hid processor_thermal_wt_hint btmtk dw_dmac vfat mousedev dell_smbios snd_hda_scodec_cs35l41_i2c hid_generic iTCO_vendor_support intel_cstate fat mei_hdcp mei_pxp intel_rapl_msr dcdbas pps_core intel_uncore dell_smm_hwmon dell_wmi_ddv dell_wmi_sysman bluetooth psmouse pcspkr iwlwifi firmware_attributes_class spi_nor snd_hwdep snd_hda_scodec_cs35l41
[   47.050283]  processor_thermal_rfim videobuf2_common i2c_i801 processor_thermal_rapl nvidia_drm(OE) intel_rapl_common dell_wmi_descriptor wmi_bmof crc16 cfg80211 mc mei_me snd_pcm mtd snd_hda_cs_dsp_ctls i2c_smbus processor_thermal_wt_req intel_lpss_pci ucsi_acpi snd_soc_cs_amp_lib mei snd_timer i2c_mux nvidia_modeset(OE) typec_ucsi intel_lpss cs_dsp intel_ish_ipc thunderbolt processor_thermal_power_floor rfkill intel_ishtp typec idma64 snd_soc_cs35l41_lib processor_thermal_mbox roles igen6_edac intel_pmc_core snd int3403_thermal soundcore intel_vsec int340x_thermal_zone dptf_power int3400_thermal i2c_hid_acpi pmt_telemetry intel_hid serial_multi_instantiate acpi_thermal_rel i2c_hid pmt_class acpi_tad sparse_keymap acpi_pad pinctrl_tigerlake mac_hid nvidia_uvm(OE) nvidia(OE) sg loop crypto_user nfnetlink zram ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq dm_crypt cbc encrypted_keys trusted asn1_encoder tee xe drm_ttm_helper gpu_sched drm_suballoc_helper drm_gpuvm drm_exec dm_mod
[   47.050338]  crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel rtsx_pci_sdmmc mmc_core serio_raw sha512_ssse3 sha256_ssse3 atkbd sha1_ssse3 libps2 nvme aesni_intel spi_intel_pci vivaldi_fmap crypto_simd nvme_core xhci_pci i8042 cryptd spi_intel nvme_auth rtsx_pci xhci_pci_renesas serio i915 i2c_algo_bit drm_buddy video wmi ttm drm_display_helper cec intel_agp intel_gtt
[   47.050365] CPU: 3 PID: 2097 Comm: nv_queue Tainted: G     U  W  OE      6.10.6-arch1-1 #1 703d152c24f1971e36f16e505405e456fc9e23f8
[   47.050368] Hardware name: Dell Inc. XPS 15 9530/09J5GK, BIOS 1.13.0 04/11/2024
[   47.050370] RIP: 0010:follow_pte+0x1de/0x200
[   47.050374] Code: cc cc cc 48 81 e2 00 00 00 c0 48 09 c2 48 f7 d2 48 85 fa 75 20 e8 b2 f5 ff ff 48 8b 35 6b f1 5c 01 48 81 e6 00 00 00 c0 eb 8d <0f> 0b 48 3b 1f 0f 83 50 fe ff ff bd ea ff ff ff eb b6 49 8b 3c 24
[   47.050376] RSP: 0018:ffffaa920550bb48 EFLAGS: 00010246
[   47.050379] RAX: 0000000000000000 RBX: 00007dd51a993000 RCX: ffffaa920550bb88
[   47.050381] RDX: ffffaa920550bb80 RSI: 00007dd51a993000 RDI: ffff8bc3e9177c08
[   47.050382] RBP: ffffaa920550bbc8 R08: ffffaa920550bd20 R09: 0000000000000000
[   47.050384] R10: 0000000000000200 R11: 0000000000000003 R12: ffffaa920550bb88
[   47.050386] R13: ffffaa920550bb80 R14: ffff8bc3cfea4d00 R15: 0000000000000000
[   47.050387] FS:  0000000000000000(0000) GS:ffff8bcb2ef80000(0000) knlGS:0000000000000000
[   47.050389] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   47.050391] CR2: 0000722a31ec2000 CR3: 000000010378c000 CR4: 0000000000f50ef0
[   47.050393] PKRU: 55555554
[   47.050394] Call Trace:
[   47.050396]  <TASK>
[   47.050397]  ? follow_pte+0x1de/0x200
[   47.050401]  ? __warn.cold+0x8e/0xe8
[   47.050403]  ? follow_pte+0x1de/0x200
[   47.050407]  ? report_bug+0xff/0x140
[   47.050412]  ? handle_bug+0x3c/0x80
[   47.050414]  ? exc_invalid_op+0x17/0x70
[   47.050417]  ? asm_exc_invalid_op+0x1a/0x20
[   47.050423]  ? follow_pte+0x1de/0x200
[   47.050428]  follow_phys+0x49/0x110
[   47.050433]  untrack_pfn+0x55/0x120
[   47.050436]  unmap_single_vma+0xa6/0xe0
[   47.050441]  zap_page_range_single+0x122/0x1d0
[   47.050448]  unmap_mapping_range+0x116/0x140
[   47.050454]  ? __pfx__main_loop+0x10/0x10 [nvidia dab7b05891cf4c2f2f23ddc5e12cc36b9c220682]
[   47.050630]  nv_revoke_gpu_mappings+0x67/0xb0 [nvidia dab7b05891cf4c2f2f23ddc5e12cc36b9c220682]
[   47.050777]  RmHandleIdleSustained+0x3b/0x140 [nvidia dab7b05891cf4c2f2f23ddc5e12cc36b9c220682]
[   47.051066]  ? gpumgrGetGpu+0x69/0xa0 [nvidia dab7b05891cf4c2f2f23ddc5e12cc36b9c220682]
[   47.051373]  rm_execute_work_item+0xda/0x150 [nvidia dab7b05891cf4c2f2f23ddc5e12cc36b9c220682]
[   47.051660]  _main_loop+0x95/0x150 [nvidia dab7b05891cf4c2f2f23ddc5e12cc36b9c220682]
[   47.051814]  kthread+0xcf/0x100
[   47.051818]  ? __pfx_kthread+0x10/0x10
[   47.051822]  ret_from_fork+0x31/0x50
[   47.051826]  ? __pfx_kthread+0x10/0x10
[   47.051829]  ret_from_fork_asm+0x1a/0x30
[   47.051833]  </TASK>
[   47.051834] ---[ end trace 0000000000000000 ]---
[  940.551483] usb 3-9: reset full-speed USB device number 3 using xhci_hcd
[  940.875180] usb 3-9: reset full-speed USB device number 3 using xhci_hcd
[  942.075059] usb 3-9: reset full-speed USB device number 3 using xhci_hcd
[  973.970941] ACPI Error: Aborting method \_SB.PC00.PEG1.NPON due to previous error (AE_AML_LOOP_TIMEOUT) (20240322/psparse-529)
[  973.970973] ACPI Error: Aborting method \_SB.PC00.PEG1.PG01._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20240322/psparse-529)
[  974.997575] pcieport 0000:00:01.0: broken device, retraining non-functional downstream link at 2.5GT/s
[  975.997705] pcieport 0000:00:01.0: retraining failed
[  977.241076] pcieport 0000:00:01.0: broken device, retraining non-functional downstream link at 2.5GT/s
[  978.244333] pcieport 0000:00:01.0: retraining failed
[  978.244345] nvidia 0000:01:00.0: not ready 1023ms after resume; waiting
[  979.294197] nvidia 0000:01:00.0: not ready 2047ms after resume; waiting
[  981.454286] nvidia 0000:01:00.0: not ready 4095ms after resume; waiting
[  985.650343] warning: `ThreadPoolForeg' uses wireless extensions which will stop working for Wi-Fi 7 hardware; use nl80211
[  985.720765] nvidia 0000:01:00.0: not ready 8191ms after resume; waiting
[  994.041026] nvidia 0000:01:00.0: not ready 16383ms after resume; waiting
[ 1011.750842] nvidia 0000:01:00.0: not ready 32767ms after resume; waiting
[ 1045.882095] nvidia 0000:01:00.0: not ready 65535ms after resume; giving up
[ 1045.882142] nvidia 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 1045.942256] NVRM: GPU at PCI:0000:01:00: GPU-2ca72de5-99b6-705d-0904-458a7021df04
[ 1045.942258] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[ 1045.942262] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[ 1045.942410] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
[ 1045.942412] NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
[ 1045.942433] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
[ 1045.942434] NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
[ 1045.942436] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
[ 1045.942437] NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
[ 1045.942451] NVRM: RmLogGpuCrash: RmLogGpuCrash: failed to save GPU crash data
[ 1045.948788] NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
[ 1045.948795] NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
[ 1045.948796] NVRM: s_dmaPoll_GA102: Error while waiting for Falcon DMA; mode: 0, status: 0x00000065
[ 1045.948797] NVRM: nvAssertOkFailedNoLog: Assertion failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from s_dmaTransfer_GA102(pGpu, pKernelFlcn, pUcode->imemPa, pUcode->imemVa, srcPhysAddr, pUcode->imemSize, dmaCmd) @ kernel_gsp_falcon_ga102.c:218
[ 1045.948799] NVRM: s_executeBooterUcode_TU102: failed to execute Booter: status 0x65, mailbox 0xec5b000
[ 1045.948800] NVRM: kgspExecuteBooterLoad_TU102: failed to execute Booter Load: 0x65
[ 1045.948801] NVRM: nvAssertOkFailedNoLog: Assertion failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from kgspExecuteBooterLoad_HAL(pGpu, pKernelGsp, memdescGetPhysAddr(pKernelGsp->pSRMetaDescriptor, AT_GPU,0)) @ kernel_gsp_tu102.c:1121
[ 1045.952445] NVRM: nvCheckOkFailedNoLog: Check failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from kgspRestorePowerMgmtState_HAL(pGpu, pKernelGsp) @ gpu_suspend.c:195
[ 1050.960829] NVRM: Error in service of callback 
[ 1050.960850] NVRM: Error in service of callback