RTX4090, torch, kernel tried to execute NX-protected page

My setup is:

ProArt Z790-CREATOR WIFI, BIOS 2102 03/15/2024
Intel® Core™ i9-14900KF
RAM: 64GB 5200MHz (4x 32 GB ) Kingston FURY Beast
NVIDIA GeForce RTX 4090

I tried drivers 535/545/550, latest 550.67.
Linux is 23.10. I tried reinstalling it from scratch. I tried also 22.04 server version.
I don’t have a monitor on this box.

The issue is reproducible in my case even if I run a simple torch MNIST example from examples/mnist at main · pytorch/examples · GitHub in a fresh conda env with only toch/torchvision install.
It takes few minutes or 1-2 hours to catch the error. After it computer sometimes hangs completely, sometimes not.

Issue:

2024-04-05T14:50:11.174555+02:00 sergii kernel: [ 1222.044216] kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
2024-04-05T14:50:11.174566+02:00 sergii kernel: [ 1222.044221] BUG: unable to handle page fault for address: ffffafb80805ff58
2024-04-05T14:50:11.174567+02:00 sergii kernel: [ 1222.044222] #PF: supervisor instruction fetch in kernel mode
2024-04-05T14:50:11.174568+02:00 sergii kernel: [ 1222.044223] #PF: error_code(0x0011) - permissions violation
2024-04-05T14:50:11.174568+02:00 sergii kernel: [ 1222.044224] PGD 100000067 P4D 100000067 PUD 100211067 PMD 11fb3f067 PTE 8000000111b22163
2024-04-05T14:50:11.174568+02:00 sergii kernel: [ 1222.044227] Oops: 0011 [#1] PREEMPT SMP NOPTI
2024-04-05T14:50:11.174568+02:00 sergii kernel: [ 1222.044229] CPU: 11 PID: 2915 Comm: pt_autograd_0 Tainted: P           OE      6.5.0-26-generic #26-Ubuntu
2024-04-05T14:50:11.174569+02:00 sergii kernel: [ 1222.044231] Hardware name: ASUS System Product Name/ProArt Z790-CREATOR WIFI, BIOS 2102 03/15/2024
2024-04-05T14:50:11.174569+02:00 sergii kernel: [ 1222.044232] RIP: 0010:0xffffafb80805ff58
2024-04-05T14:50:11.174569+02:00 sergii kernel: [ 1222.044254] Code: ff ff c8 bd 33 82 ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e6 00 40 82 ff ff ff ff <01> 00 00 00 00 00 00 00 b0 2a f9 07 00 00 00 00 d0 49 06 08 00 00
2024-04-05T14:50:11.174570+02:00 sergii kernel: [ 1222.044255] RSP: 0018:ffffafb80805ff08 EFLAGS: 00010046
2024-04-05T14:50:11.174570+02:00 sergii kernel: [ 1222.044256] RAX: 0000000000000000 RBX: ffffafb80805ff58 RCX: 0000000000000000
2024-04-05T14:50:11.174570+02:00 sergii kernel: [ 1222.044257] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
2024-04-05T14:50:11.174570+02:00 sergii kernel: [ 1222.044258] RBP: ffffafb80805ff08 R08: 0000000000000000 R09: 0000000000000000
2024-04-05T14:50:11.174570+02:00 sergii kernel: [ 1222.044258] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
2024-04-05T14:50:11.174571+02:00 sergii kernel: [ 1222.044259] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
2024-04-05T14:50:11.174571+02:00 sergii kernel: [ 1222.044259] FS:  00007f063bfff6c0(0000) GS:ffff97c3ffac0000(0000) knlGS:0000000000000000
2024-04-05T14:50:11.174571+02:00 sergii kernel: [ 1222.044260] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-04-05T14:50:11.174571+02:00 sergii kernel: [ 1222.044261] CR2: ffffafb80805ff58 CR3: 000000015bdf4000 CR4: 0000000000750ee0
2024-04-05T14:50:11.174571+02:00 sergii kernel: [ 1222.044261] PKRU: 55555554
2024-04-05T14:50:11.174572+02:00 sergii kernel: [ 1222.044262] Call Trace:
2024-04-05T14:50:11.174572+02:00 sergii kernel: [ 1222.044263]  <TASK>
2024-04-05T14:50:11.174572+02:00 sergii kernel: [ 1222.044264]  ? show_regs+0x6d/0x80
2024-04-05T14:50:11.174572+02:00 sergii kernel: [ 1222.044267]  ? __die+0x24/0x80
2024-04-05T14:50:11.174572+02:00 sergii kernel: [ 1222.044268]  ? page_fault_oops+0x99/0x1b0
2024-04-05T14:50:11.174572+02:00 sergii kernel: [ 1222.044270]  ? kernelmode_fixup_or_oops+0xb2/0x140
2024-04-05T14:50:11.174573+02:00 sergii kernel: [ 1222.044271]  ? __bad_area_nosemaphore+0x1a5/0x2c0
2024-04-05T14:50:11.174573+02:00 sergii kernel: [ 1222.044272]  ? bad_area_nosemaphore+0x16/0x30
2024-04-05T14:50:11.174573+02:00 sergii kernel: [ 1222.044273]  ? do_kern_addr_fault+0x7b/0xa0
2024-04-05T14:50:11.174573+02:00 sergii kernel: [ 1222.044274]  ? exc_page_fault+0x1a4/0x1b0
2024-04-05T14:50:11.174573+02:00 sergii kernel: [ 1222.044276]  ? asm_exc_page_fault+0x27/0x30
2024-04-05T14:50:11.174573+02:00 sergii kernel: [ 1222.044279]  do_syscall_64+0x68/0x90
2024-04-05T14:50:11.174574+02:00 sergii kernel: [ 1222.044280]  ? do_syscall_64+0x68/0x90
2024-04-05T14:50:11.174574+02:00 sergii kernel: [ 1222.044281]  ? do_syscall_64+0x68/0x90
2024-04-05T14:50:11.174574+02:00 sergii kernel: [ 1222.044282]  ? do_syscall_64+0x68/0x90
2024-04-05T14:50:11.174574+02:00 sergii kernel: [ 1222.044283]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
2024-04-05T14:50:11.174574+02:00 sergii kernel: [ 1222.044285] RIP: 0033:0x7f08b450df2b
2024-04-05T14:50:11.174574+02:00 sergii kernel: [ 1222.044287] Code: 73 01 c3 48 8b 0d ed fe 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 18 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d bd fe 0e 00 f7 d8 64 89 01 48
2024-04-05T14:50:11.174575+02:00 sergii kernel: [ 1222.044287] RSP: 002b:00007f063bff9578 EFLAGS: 00000246 ORIG_RAX: 0000000000000018
2024-04-05T14:50:11.174575+02:00 sergii kernel: [ 1222.044288] RAX: 0000000000000000 RBX: 00000000081fe280 RCX: 00007f08b450df2b
2024-04-05T14:50:11.174575+02:00 sergii kernel: [ 1222.044289] RDX: 00000000000001b4 RSI: 00000000000003ff RDI: 0000000008274190
2024-04-05T14:50:11.174575+02:00 sergii kernel: [ 1222.044289] RBP: 00007f063bff9610 R08: 0000000000000000 R09: 0000000000000000
2024-04-05T14:50:11.174575+02:00 sergii kernel: [ 1222.044290] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000018000
2024-04-05T14:50:11.174576+02:00 sergii kernel: [ 1222.044290] R13: 00000000080649d0 R14: 0000000007f92ab0 R15: 0000000000000001
2024-04-05T14:50:11.174576+02:00 sergii kernel: [ 1222.044291]  </TASK>
2024-04-05T14:50:11.174576+02:00 sergii kernel: [ 1222.044292] Modules linked in: nvidia_uvm(POE) ccm rfcomm snd_seq_dummy snd_hrtimer cmac algif_hash algif_skcipher af_alg bnep binfmt_misc input_leds joydev btusb btrtl btbcm btintel btmtk bluetooth ecdh_generic apple_mfi_fastcharge ecc hid_generic usbhid hid nls_iso8859_1 intel_rapl_msr intel_rapl_common intel_uncore_frequency nvidia_drm(POE) intel_uncore_frequency_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp nvidia_modeset(POE) coretemp snd_sof_pci_intel_tgl snd_sof_intel_hda_common kvm_intel soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence nvidia(POE) snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp kvm snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core irqbypass snd_soc_acpi_intel_match crct10dif_pclmul snd_soc_acpi crc32_pclmul soundwire_generic_allocation polyval_clmulni polyval_generic soundwire_bus ghash_clmulni_intel iwlmvm sha256_ssse3 snd_soc_core sha1_ssse3 aesni_intel snd_hda_codec_realtek snd_compress crypto_simd tps6598x cryptd snd_hda_codec_generic snd_hda_codec_hdmi
2024-04-05T14:50:11.174576+02:00 sergii kernel: [ 1222.044316]  ac97_bus mac80211 snd_pcm_dmaengine snd_hda_intel nouveau snd_intel_dspcfg rapl snd_intel_sdw_acpi snd_hda_codec libarc4 snd_hda_core snd_hwdep snd_pcm mxm_wmi drm_ttm_helper snd_seq_midi snd_seq_midi_event ttm snd_rawmidi drm_display_helper iwlwifi cmdlinepart asus_nb_wmi mfd_aaeon eeepc_wmi cec asus_wmi snd_seq spi_nor ledtrig_audio rc_core pmt_telemetry sparse_keymap drm_kms_helper snd_seq_device mtd platform_profile pmt_class wmi_bmof cfg80211 thunderbolt i2c_algo_bit snd_timer intel_cstate atlantic video igc macsec snd i2c_i801 intel_lpss_pci spi_intel_pci spi_intel ahci intel_lpss soundcore i2c_smbus ucsi_acpi idma64 libahci intel_vsec typec_ucsi typec serial_multi_instantiate wmi acpi_tad acpi_pad msr parport_pc ppdev lp drm parport efi_pstore dmi_sysfs ip_tables x_tables autofs4 nvme mei_me xhci_pci nvme_core xhci_pci_renesas mei nvme_common vmd pinctrl_alderlake mac_hid
2024-04-05T14:50:11.174577+02:00 sergii kernel: [ 1222.044342] CR2: ffffafb80805ff58
2024-04-05T14:50:11.174577+02:00 sergii kernel: [ 1222.044343] ---[ end trace 0000000000000000 ]---
2024-04-05T14:50:11.174577+02:00 sergii kernel: [ 1222.120989] RIP: 0010:0xffffafb80805ff58
2024-04-05T14:50:11.174577+02:00 sergii kernel: [ 1222.121000] Code: ff ff c8 bd 33 82 ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e6 00 40 82 ff ff ff ff <01> 00 00 00 00 00 00 00 b0 2a f9 07 00 00 00 00 d0 49 06 08 00 00
2024-04-05T14:50:11.174577+02:00 sergii kernel: [ 1222.121001] RSP: 0018:ffffafb80805ff08 EFLAGS: 00010046
2024-04-05T14:50:11.174578+02:00 sergii kernel: [ 1222.121002] RAX: 0000000000000000 RBX: ffffafb80805ff58 RCX: 0000000000000000
2024-04-05T14:50:11.174578+02:00 sergii kernel: [ 1222.121003] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
2024-04-05T14:50:11.174578+02:00 sergii kernel: [ 1222.121004] RBP: ffffafb80805ff08 R08: 0000000000000000 R09: 0000000000000000
2024-04-05T14:50:11.174578+02:00 sergii kernel: [ 1222.121004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
2024-04-05T14:50:11.174578+02:00 sergii kernel: [ 1222.121005] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
2024-04-05T14:50:11.174578+02:00 sergii kernel: [ 1222.121005] FS:  00007f063bfff6c0(0000) GS:ffff97c3ffac0000(0000) knlGS:0000000000000000
2024-04-05T14:50:11.174579+02:00 sergii kernel: [ 1222.121006] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-04-05T14:50:11.174579+02:00 sergii kernel: [ 1222.121007] CR2: ffffafb80805ff58 CR3: 000000015bdf4000 CR4: 0000000000750ee0
2024-04-05T14:50:11.174579+02:00 sergii kernel: [ 1222.121008] PKRU: 55555554
2024-04-05T14:50:11.174579+02:00 sergii kernel: [ 1222.121008] note: pt_autograd_0[2915] exited with irqs disabled

Depending on driver version, the issue can be

BUG: kernel NULL pointer dereference, address: 0000000000000001
2 Likes

Hello, do you solve the problem?
I have the same error like you and i tested different ubuntu and NVIDIA drivers versions, and still got the same issue

I also have a similar setup:

PRIME Z790-P, BIOS 1656 04/18/2024
Intel(R) Core(TM) i9-14900K
RAM: 64GB 5200MHz Kingston FURY
NVIDIA GeForce RTX 4090

I used Python environments and conda.

Log:

may 10 11:23:58 upcgaia kernel: kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
may 10 11:23:58 upcgaia kernel: BUG: unable to handle page fault for address: ffffbc7743073f58
may 10 11:23:58 upcgaia kernel: #PF: supervisor instruction fetch in kernel mode
may 10 11:23:58 upcgaia kernel: #PF: error_code(0x0011) - permissions violation
may 10 11:23:58 upcgaia kernel: PGD 100000067 P4D 100000067 PUD 100211067 PMD 13479a067 PTE 8000000285487163
may 10 11:23:58 upcgaia kernel: Oops: 0011 [#1] PREEMPT SMP NOPTI
may 10 11:23:58 upcgaia kernel: CPU: 4 PID: 8939 Comm: python Tainted: P           OE      6.5.0-28-generic #29~22.04.1-Ubuntu
may 10 11:23:58 upcgaia kernel: Hardware name: ASUS System Product Name/PRIME Z790-P, BIOS 1656 04/18/2024
may 10 11:23:58 upcgaia kernel: RIP: 0010:0xffffbc7743073f58
may 10 11:23:58 upcgaia kernel: Code: ff ff f7 9c b1 9f ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e6 00 c0 9f ff ff ff ff <00> 00 00 00 00 00 00 00 00 41 6e 14 97 5d 00 00 b0 3a 6f 14 97 5d
may 10 11:23:58 upcgaia kernel: RSP: 0018:ffffbc7743073ed8 EFLAGS: 00010046
may 10 11:23:58 upcgaia kernel: RAX: 0000000000000000 RBX: ffffbc7743073f58 RCX: 0000000000000000
may 10 11:23:58 upcgaia kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
may 10 11:23:58 upcgaia kernel: RBP: ffffbc7743073ed8 R08: 0000000000000000 R09: 0000000000000000
may 10 11:23:58 upcgaia kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
may 10 11:23:58 upcgaia kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
may 10 11:23:58 upcgaia kernel: FS:  0000730474ff9640(0000) GS:ffff934cbf100000(0000) knlGS:0000000000000000
may 10 11:23:58 upcgaia kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
may 10 11:23:58 upcgaia kernel: CR2: ffffbc7743073f58 CR3: 000000011b9b2000 CR4: 0000000000750ee0
may 10 11:23:58 upcgaia kernel: PKRU: 55555554
may 10 11:23:58 upcgaia kernel: Call Trace:
may 10 11:23:58 upcgaia kernel:  <TASK>
may 10 11:23:58 upcgaia kernel:  ? show_regs+0x6d/0x80
may 10 11:23:58 upcgaia kernel:  ? __die+0x24/0x80
may 10 11:23:58 upcgaia kernel:  ? page_fault_oops+0x99/0x1b0
may 10 11:23:58 upcgaia kernel:  ? kernelmode_fixup_or_oops+0xb2/0x140
may 10 11:23:58 upcgaia kernel:  ? __bad_area_nosemaphore+0x1a5/0x2c0
may 10 11:23:58 upcgaia kernel:  ? bad_area_nosemaphore+0x16/0x30
may 10 11:23:58 upcgaia kernel:  ? do_kern_addr_fault+0x7b/0xa0
may 10 11:23:58 upcgaia kernel:  ? exc_page_fault+0x10d/0x1b0
may 10 11:23:58 upcgaia kernel:  ? asm_exc_page_fault+0x27/0x30
may 10 11:23:58 upcgaia kernel:  do_syscall_64+0x67/0x90
may 10 11:23:58 upcgaia kernel:  ? do_syscall_64+0x67/0x90
may 10 11:23:58 upcgaia kernel:  ? do_syscall_64+0x67/0x90
may 10 11:23:58 upcgaia kernel:  ? syscall_exit_to_user_mode+0x37/0x60
may 10 11:23:58 upcgaia kernel:  ? do_syscall_64+0x67/0x90
may 10 11:23:58 upcgaia kernel:  ? do_syscall_64+0x67/0x90
may 10 11:23:58 upcgaia kernel:  ? do_syscall_64+0x67/0x90
may 10 11:23:58 upcgaia kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
may 10 11:23:58 upcgaia kernel: RIP: 0033:0x730704f08c9b
may 10 11:23:58 upcgaia kernel: Code: 73 01 c3 48 8b 0d 95 11 11 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 18 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 65 11 11 00 f7 d8 64 89 01 48
may 10 11:23:58 upcgaia kernel: RSP: 002b:0000730474ff70d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000018
may 10 11:23:58 upcgaia kernel: RAX: 0000000000000000 RBX: 00005d9714679000 RCX: 0000730704f08c9b
may 10 11:23:58 upcgaia kernel: RDX: 0000000000000000 RSI: 00000000000003ff RDI: 00005d97146e5570
may 10 11:23:58 upcgaia kernel: RBP: 0000730474ff7130 R08: 0000000000000000 R09: 0000000000000000
may 10 11:23:58 upcgaia kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000018000
may 10 11:23:58 upcgaia kernel: R13: 00005d97146f3ab0 R14: 00005d97146e4100 R15: 0000000000000000
may 10 11:23:58 upcgaia kernel:  </TASK>
may 10 11:23:58 upcgaia kernel: Modules linked in: binfmt_misc nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci s>
may 10 11:23:58 upcgaia kernel:  intel_cstate i2c_i801 pmt_class intel_lpss mei_me sparse_keymap realtek platform_profile mei idma64 spi_intel_pci spi_intel i2c_smbus intel_vsec wmi_bmof mac_hid nvidia_uvm(POE) acpi_pad acpi_tad sch_fq_codel efi_pstore ip_t>
may 10 11:23:58 upcgaia kernel: CR2: ffffbc7743073f58
may 10 11:23:58 upcgaia kernel: ---[ end trace 0000000000000000 ]---
may 10 11:23:58 upcgaia kernel: RIP: 0010:0xffffbc7743073f58
may 10 11:23:58 upcgaia kernel: Code: ff ff f7 9c b1 9f ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e6 00 c0 9f ff ff ff ff <00> 00 00 00 00 00 00 00 00 41 6e 14 97 5d 00 00 b0 3a 6f 14 97 5d
may 10 11:23:58 upcgaia kernel: RSP: 0018:ffffbc7743073ed8 EFLAGS: 00010046
may 10 11:23:58 upcgaia kernel: RAX: 0000000000000000 RBX: ffffbc7743073f58 RCX: 0000000000000000
may 10 11:23:58 upcgaia kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
may 10 11:23:58 upcgaia kernel: RBP: ffffbc7743073ed8 R08: 0000000000000000 R09: 0000000000000000
may 10 11:23:58 upcgaia kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
may 10 11:23:58 upcgaia kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
may 10 11:23:58 upcgaia kernel: FS:  0000730474ff9640(0000) GS:ffff934cbf100000(0000) knlGS:0000000000000000
may 10 11:23:58 upcgaia kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
may 10 11:23:58 upcgaia kernel: CR2: ffffbc7743073f58 CR3: 000000011b9b2000 CR4: 0000000000750ee0
may 10 11:23:58 upcgaia kernel: PKRU: 55555554
may 10 11:23:58 upcgaia kernel: note: python[8939] exited with irqs disabled

No, the issue is not solved but localised. I think I had two different issues:

  • kernel tried to execute NX-protected page
  • BUG: kernel NULL pointer dereference, address: 0000000000000000

I think both issues are not gpu hardware related. I was having them both on RTX4090 and GTX1090Ti. Seems like your hardware is very similar - so its somewhere there: motherboard/cpu/ram/nvme or compatibility of those with linux.

What I did is:

  • degraded linux from 23.10 to 22.04 - this helped to mitigate kernel tried to execute NX-protected page - I still see it on logs but it does not hang the computer any more
  • I localised that BUG: kernel NULL pointer dereference, address: 0000000000000000, which actually cause computer to freeze, is coming from some hugingface packages: peft / transformers / trl. If I train model on pure torch - there is no issue.

What is the problem in your case? Does you computer crash/hang ? What packages do you use ?

I tried on Ubuntu 22.04, 24.06, and also on Windows 11. I also tried different NVIDIA Driver Versions such as 535, 545, and 550, and also updated the MB BIOS.

I used Python 3.10, and 3.12, with Tensorflow/keras (2.15, 2.16) on a Python environment or conda to run different models (ConvLSTM, CNN, LogisticRegression, among others) on GPU/CPU training. In the middle of the training Ubuntu froze, or the Python execution just hung. Also in Windows in the middle of the training python crashes or Windows gets a blue screen.

I think is more of a CPU/MB issue in combination with TF, since every time that I use only the CPU crashes in a short time, but when the GPU is used sometimes the training ends properly.

Main issues that I find on Ubuntu:

  • kernel tried to execute NX-protected page - exploit attempt?
  • BUG: kernel NULL pointer dereference, address: 0000000000000000
  • watchdog: Watchdog detected hard LOCKUP on cpu 4
  • watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [kworker/6:1:7415]
  • rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: 4-...0: (17 ticks this GP) idle=f6b4/1/0x4000000000000000 softirq=120486/120489 fqs=2962
  • rcu: rcu_preempt kthread starved for 7501 jiffies! g176097 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=22
  • rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
1 Like

Are you guys sure this is GPU issue?
Since you use Intel i9-14900KF - it may be related to Intel bugs in those CPU.

Try to test CPU/RAM-only task and see if you get errors in few hours of test.

2 Likes

yeah, that was a good direction of search, thank you for the suggestion. seems like people have lots of issues with 14900 out there.

I did not try the solutions mentioned here https://community.intel.com/t5/Processors/Solved-Stability-issue-with-proc-I9-14900K-crash-BSOD/m-p/1574516#M69747 but disabling Hyper-Threading in BIOS seems like solved my issue.

@ManuelMD can you try disabling Hyper-Threading in BIOS and see if this helps in your case?

1 Like

Seems that disabling Hyper-Threading solved the issue. I have made some testing and so far is working good.

I saw some posts about this issue, and seems that the cores don’t get enough voltage, somehow the CPU was working properly the first month of use, and then the issues started like hours of running to the point that it took only a few minutes to crash. This means that probably the CPU does not have good stability with V-cores.

Thanks a lot! :)

there seems to be BIOS updates popping up, which include Intel Baseline Profile apparently solving the issue:

PROART Z790-CREATOR WIFI BIOS 2202
Version 2202
13.43 MB
2024/04/19
"The update introduces the Intel Baseline Profile option, allowing users to revert to Intel factory default settings for basic functionality, lower power limits, and improving stability in certain games.

seems like not in my case though

UPDT: enabling Intel Baseline Profile and / or underclock CPU at 57x worked for me with enabled HT