Bug report: 455.23.04 - Kernel Panic due to NULL pointer dereference

Can you test if the patch from 455.23.04: Page allocation failure in kernel module at random points helps?

Thanks for your report. It helps. I downgraded the driver but not the kernel. I will keep informed.

Man I don’t get it, why are you so heated on this convo, there is no need for this sort of language when talking to me or anyone else on the internet. Nvidia is not being nice to you and you are still defending them.

I don’t want to drag on this convo for along longer than it needs to be as you are already too heated for any reasonable convo to happen.

Just saying stuff like this would never fly in windows. I could not imagine any drivers on windows forcing a user to keep an unsafe older kernel and people like you still defending them. Like I said before and I’ll say it again, nvidia does not care about anything about you except your money. If any of the devs would like to provide a public statement I’ll be happy to listen but your crudeness is just uncalled for when I thought I had a well structured argument and even referenced.

If you do work for nvidia then I’m sorry if I offended you in anyway shape or form but it’s not you that I’m annoyed at like I said in my first post, Nvidia treats Linux users like crap and they will continue to. I simply felt like this was not voiced at all in this thread.

I’m not gonna fire back on any of your statements even though they are quite self absorbed and give off a massive “idgaf and neither should you” vibe. You can take this win cus like @kerberizer have said, it’s uncivil and unneeded. I didn’t feel the need to insult you and you shouldn’t have felt that need to insult me either, reflect on why you felt that and why you felt like saying those things to me. I think it’ll make you a better person in the long run. Don’t mean to offend you or any one else here.

Again, nvidia like AMD (and any other corp) doesn’t care about anything about you except your money. I’m not an AMD shill like you have tried to paint me, I’m trying to voice my frustration at a company that doesn’t care about the consumer. Enjoy your nvidia GPU and the rest of your day. I’m out.

This patch seems to work for me on Linux 5.9.10-zen1-1-zen for a while. I have had a freeze after leaving my PC playing a YouTube video on Firefox so I don’t think it’s fully stable but definitely more stable than just running 455.45.01.

Hi Yuannan,

Please capture nvidia bug report once you hit with issue again and share with us.

Requesting others also to create and share nvidia bug report when issue is triggered.
We do have few bug reports already in this forum but needed few more to understand issue as we are having trouble reproducing issue locally.

1 Like

Thank you amrits,

I’ll try and set up SSH as some users have reported it still working even after the freeze. I’ll try and get journalctl logs as well if they are written before the hard freeze. Sorry, didn’t mean to just provide an empty report but I literally couldn’t in this case.

It does seem a bit weird that you are having trouble reproducing this. I have some experience in operating systems programming but not on this level but from what I can gather and how the patch has helped it might be a stack/buffer overflow and corruption of some essential calls stacks. I wonder if there is a safe and sanitary way to dump the memory as well as the register logs in order to assist in debugging. The current logs do provide registers but it’s not much help if you don’t know how it got there. It would be extremely invasive for the drivers to do so by default but I’d be happy to run a logging daemon for the purposes of squashing this bug. Again not sure how great of an idea this is and how helpful it could be considering how invasive this is.

Also don’t know if this is related as I’ve been ignoring it for a while now.
image
In my music player (Tauon Music Player) it has multiple devices under “HDA Nvidia at 0xf6080000 irq 71” despite me setting the pulseaudio profile to off, essentially disabling it. I don’t see any Nvidia devices in pavucontrol or under “pacmd list-sinks” once I have disabled this. Clicking one will also select all of them and in doing so freeze the program until I force kill it. Tauon is fully FOSS afiak, so might be worth looking into it’s code as this bug does seem to be triggered by audio/media/vdpau. Don’t know enough to fully comment on the implications but I thought I would bring it up.

I’m sorry if any of my previous posts came off in the wrong way, I do greatly appreciate your work and don’t mean to attack anyone in particular. There are dozens of us using it everyday! Dozens! However I do still feel Nvidia does not treat Linux users with anywhere near the same amount of respect and stability as Windows users at all.

Thank you :), lmk if there is anything else I can do to help.

This is what I get now before a freeze on 455.45.01-4:

nvidia-modeset/: page allocation failure: order:4, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 2 PID: 666 Comm: nvidia-modeset/ Tainted: P           OE     5.9.11-arch2-1 #1
Hardware name: ASUS All Series/Z87I-DELUXE, BIOS 1104 05/29/2014
Call Trace:
 dump_stack+0x6b/0x83
 warn_alloc.cold+0x78/0xdc
 ? __alloc_pages_direct_compact+0x140/0x160
 __alloc_pages_slowpath.constprop.0+0xcdd/0xd10
 ? _nv002242kms+0x380/0x6f0 [nvidia_modeset]
 __alloc_pages_nodemask+0x2f2/0x320
 kmalloc_order+0x28/0x80
 kmalloc_order_trace+0x1d/0xb0
 __kmalloc+0x266/0x2a0
 nvkms_alloc+0x20/0x50 [nvidia_modeset]
 _nv002718kms+0x16/0x30 [nvidia_modeset]
 ? _nv002593kms+0x4e/0x1610 [nvidia_modeset]
 ? _nv002426kms+0x40/0x40 [nvidia_modeset]
 ? _nv000550kms+0x365/0x3c0 [nvidia_modeset]
 ? _nv002680kms+0x309/0x3c0 [nvidia_modeset]
 ? _nv002698kms+0x29f/0x540 [nvidia_modeset]
 ? schedule+0x50/0xf0
 ? schedule_timeout+0x12d/0x170
 ? preempt_count_add+0x68/0xa0
 ? _raw_spin_lock_irq+0x1a/0x40
 ? __down_interruptible+0x94/0x100
 ? _nv000528kms+0x71/0x80 [nvidia_modeset]
 ? nvkms_kthread_q_callback+0x7c/0xd0 [nvidia_modeset]
 ? _main_loop+0x83/0x130 [nvidia_modeset]
 ? nvkms_sema_up+0x10/0x10 [nvidia_modeset]
 ? kthread+0x142/0x160
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x22/0x30
Mem-Info:
active_anon:464077 inactive_anon:712988 isolated_anon:0
 active_file:181075 inactive_file:179078 isolated_file:0
 unevictable:40 dirty:25006 writeback:0
 slab_reclaimable:31846 slab_unreclaimable:59060
 mapped:234022 shmem:111527 pagetables:21484 bounce:0
 free:62549 free_pcp:315 free_cma:0
Node 0 active_anon:1856308kB inactive_anon:2851952kB active_file:724300kB inactive_file:716312kB unevictable:160kB isolated(anon):0kB isolated(file):0kB mapped:936088kB dirty:100024kB writeback:0kB shmem:446>
Node 0 DMA free:14872kB min:132kB low:164kB high:196kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984kB managed:159>
lowmem_reserve[]: 0 3407 7858 7858 7858
Node 0 DMA32 free:95564kB min:95056kB low:102368kB high:109680kB reserved_highatomic:0KB active_anon:789856kB inactive_anon:1303888kB active_file:298044kB inactive_file:378464kB unevictable:16kB writepending>
lowmem_reserve[]: 0 0 4450 4450 4450
Node 0 Normal free:139760kB min:124144kB low:133692kB high:143240kB reserved_highatomic:0KB active_anon:1065936kB inactive_anon:1548064kB active_file:425792kB inactive_file:337976kB unevictable:144kB writepe>
lowmem_reserve[]: 0 0 0 0 0
Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 1*2048kB (M) 3*4096kB (M) = 14872kB
Node 0 DMA32: 8219*4kB (UME) 4378*8kB (UME) 1122*16kB (UME) 311*32kB (UME) 2*64kB (ME) 1*128kB (E) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 96060kB
Node 0 Normal: 21713*4kB (UME) 4375*8kB (UME) 1037*16kB (UME) 26*32kB (UME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 139276kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
565730 total pagecache pages
116826 pages in swap cache
Swap cache stats: add 2099453, delete 1982869, find 531471/860277
Free swap  = 3767696kB
Total swap = 6835600kB
2080339 pages RAM
0 pages HighMem/MovableOnly
59797 pages reserved
0 pages hwpoisoned

About a month ago, I posted here that I was going to give the LTS kernel (5.4) a try.

So, I’ve been using it for this one month period, and I haven’t experienced a single crash yet. I’ve been also updating the NVIDIA driver like always (currently using 455.45.01).

But since some people have said they were experiencing the crash even with the LTS kernel, I might have been lucky…

1 Like

linux 5.9.9.arch1-1
nvidia 455.45.01

Dec 01 15:54:04 hostname kernel: BUG: kernel NULL pointer dereference, address: 0000000000000020
Dec 01 15:54:04 hostname kernel: #PF: supervisor read access in kernel mode
Dec 01 15:54:04 hostname kernel: #PF: error_code(0x0000) - not-present page
Dec 01 15:54:04 hostname kernel: PGD 80000001e8601067 P4D 80000001e8601067 PUD 0 
Dec 01 15:54:04 hostname kernel: Oops: 0000 [#1] PREEMPT SMP PTI
Dec 01 15:54:04 hostname kernel: CPU: 2 PID: 632 Comm: irq/38-nvidia Tainted: P           OE     5.9.9-arch1-1 #1
Dec 01 15:54:04 hostname kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97M Pro4, BIOS P1.90A 06/25/2015
Dec 01 15:54:04 hostname kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Dec 01 15:54:04 hostname kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Dec 01 15:54:04 hostname kernel: RSP: 0018:ffffae2c00aebc00 EFLAGS: 00010202
Dec 01 15:54:04 hostname kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Dec 01 15:54:04 hostname kernel: RDX: ffff9de38a28bdc8 RSI: ffffffffffffffff RDI: 0000000000000020
Dec 01 15:54:04 hostname kernel: RBP: ffff9de6186ea9d0 R08: ffffffffc2c1a530 R09: ffff9de6186ea9b0
Dec 01 15:54:04 hostname kernel: R10: ffffffffc1865820 R11: ffff9de614809008 R12: 0000000000000020
Dec 01 15:54:04 hostname kernel: R13: 0000000000000000 R14: ffff9de6186eab38 R15: ffff9de6186eac78
Dec 01 15:54:04 hostname kernel: FS:  0000000000000000(0000) GS:ffff9de670300000(0000) knlGS:0000000000000000
Dec 01 15:54:04 hostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 01 15:54:04 hostname kernel: CR2: 0000000000000020 CR3: 00000001e8e4a006 CR4: 00000000001706e0
Dec 01 15:54:04 hostname kernel: Call Trace:
Dec 01 15:54:04 hostname kernel:  ? _nv029950rm+0x1b/0x90 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv025474rm+0x18/0x60 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv011691rm+0x13d/0x1c0 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv000083rm+0x12f/0x1a0 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv036719rm+0xc3/0x350 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv036718rm+0x5c/0x70 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv011615rm+0x78/0xd0 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv011615rm+0x1a/0xd0 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv024757rm+0x251/0x3e0 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv024706rm+0x25/0x150 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv015453rm+0x9b/0x270 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv026077rm+0x290/0x290 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv027734rm+0x273/0xdc0 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv007566rm+0x155/0x270 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv027742rm+0x8d/0x180 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv000712rm+0xa9/0x200 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? disable_irq_nosync+0x10/0x10
Dec 01 15:54:04 hostname kernel:  ? rm_isr_bh+0x1c/0x60 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? nvidia_isr_kthread_bh+0x1b/0x40 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? irq_thread_fn+0x20/0x60
Dec 01 15:54:04 hostname kernel:  ? irq_thread+0xf5/0x1a0
Dec 01 15:54:04 hostname kernel:  ? irq_finalize_oneshot.part.0+0xe0/0xe0
Dec 01 15:54:04 hostname kernel:  ? irq_thread_check_affinity+0xd0/0xd0
Dec 01 15:54:04 hostname kernel:  ? kthread+0x142/0x160
Dec 01 15:54:04 hostname kernel:  ? __kthread_bind_mask+0x60/0x60
Dec 01 15:54:04 hostname kernel:  ? ret_from_fork+0x22/0x30
Dec 01 15:54:04 hostname kernel: Modules linked in: fuse xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE xt_multiport nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo ip6table_filter ip6_tables xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrac>
Dec 01 15:54:04 hostname kernel:  snd_compress ff_memless input_leds mc joydev ac97_bus snd_hda_core mousedev drm_kms_helper cec snd_hwdep snd_pcm_dmaengine mei_me i2c_nvidia_gpu rc_core e1000e snd_pcm intel_gtt syscopyarea sysfillrect sysimgblt fb_sys_f>
Dec 01 15:54:04 hostname kernel: CR2: 0000000000000020
Dec 01 15:54:04 hostname kernel: ---[ end trace 44dde195cb4c28e5 ]---
Dec 01 15:54:04 hostname kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Dec 01 15:54:04 hostname kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Dec 01 15:54:04 hostname kernel: RSP: 0018:ffffae2c00aebc00 EFLAGS: 00010202
Dec 01 15:54:04 hostname kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Dec 01 15:54:04 hostname kernel: RDX: ffff9de38a28bdc8 RSI: ffffffffffffffff RDI: 0000000000000020
Dec 01 15:54:04 hostname kernel: RBP: ffff9de6186ea9d0 R08: ffffffffc2c1a530 R09: ffff9de6186ea9b0
Dec 01 15:54:04 hostname kernel: R10: ffffffffc1865820 R11: ffff9de614809008 R12: 0000000000000020
Dec 01 15:54:04 hostname kernel: R13: 0000000000000000 R14: ffff9de6186eab38 R15: ffff9de6186eac78
Dec 01 15:54:04 hostname kernel: FS:  0000000000000000(0000) GS:ffff9de670300000(0000) knlGS:0000000000000000
Dec 01 15:54:04 hostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 01 15:54:04 hostname kernel: CR2: 0000000000000020 CR3: 00000001e8e4a006 CR4: 00000000001706e0
Dec 01 15:54:04 hostname kernel: BUG: kernel NULL pointer dereference, address: 0000000000000930
Dec 01 15:54:04 hostname kernel: #PF: supervisor write access in kernel mode
Dec 01 15:54:04 hostname kernel: #PF: error_code(0x0002) - not-present page
Dec 01 15:54:04 hostname kernel: PGD 80000001e8601067 P4D 80000001e8601067 PUD 0 
Dec 01 15:54:04 hostname kernel: Oops: 0002 [#2] PREEMPT SMP PTI
Dec 01 15:54:04 hostname kernel: CPU: 2 PID: 632 Comm: irq/38-nvidia Tainted: P      D    OE     5.9.9-arch1-1 #1
Dec 01 15:54:04 hostname kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97M Pro4, BIOS P1.90A 06/25/2015
Dec 01 15:54:04 hostname kernel: RIP: 0010:mutex_lock+0x10/0x20
Dec 01 15:54:04 hostname kernel: Code: 03 31 c0 c3 eb d4 0f 1f 40 00 0f 1f 44 00 00 be 02 00 00 00 e9 61 fa ff ff 90 0f 1f 44 00 00 31 c0 65 48 8b 14 25 c0 7b 01 00 <f0> 48 0f b1 17 75 01 c3 eb d6 66 0f 1f 44 00 00 0f 1f 44 00 00 41
Dec 01 15:54:04 hostname kernel: RSP: 0018:ffffae2c00aebe30 EFLAGS: 00010246
Dec 01 15:54:04 hostname kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Dec 01 15:54:04 hostname kernel: RDX: ffff9de6186fdb80 RSI: 0000000000000000 RDI: 0000000000000930
Dec 01 15:54:04 hostname kernel: RBP: 0000000000000930 R08: 000000000000000f R09: 0000000000000000
Dec 01 15:54:04 hostname kernel: R10: ffff9de618685c00 R11: ffffae2c00aeb801 R12: ffff9de6186fe34c
Dec 01 15:54:04 hostname kernel: R13: 0000000000000000 R14: 0000000000000001 R15: ffff9de6186fdb80
Dec 01 15:54:04 hostname kernel: FS:  0000000000000000(0000) GS:ffff9de670300000(0000) knlGS:0000000000000000
Dec 01 15:54:04 hostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 01 15:54:04 hostname kernel: CR2: 0000000000000930 CR3: 00000001e8e4a006 CR4: 00000000001706e0
Dec 01 15:54:04 hostname kernel: Call Trace:
Dec 01 15:54:04 hostname kernel:  perf_event_exit_task+0x30/0x440
Dec 01 15:54:04 hostname kernel:  ? put_cpu_partial+0x92/0x140
Dec 01 15:54:04 hostname kernel:  ? kfree+0x40f/0x440
Dec 01 15:54:04 hostname kernel:  do_exit+0x37f/0xaa0
Dec 01 15:54:04 hostname kernel:  ? task_work_run+0x5c/0x90
Dec 01 15:54:04 hostname kernel:  ? do_exit+0x36f/0xaa0
Dec 01 15:54:04 hostname kernel:  ? kthread+0x142/0x160
Dec 01 15:54:04 hostname kernel:  ? rewind_stack_do_exit+0x17/0x17
Dec 01 15:54:04 hostname kernel: Modules linked in: fuse xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE xt_multiport nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo ip6table_filter ip6_tables xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrac>
Dec 01 15:54:04 hostname kernel:  snd_compress ff_memless input_leds mc joydev ac97_bus snd_hda_core mousedev drm_kms_helper cec snd_hwdep snd_pcm_dmaengine mei_me i2c_nvidia_gpu rc_core e1000e snd_pcm intel_gtt syscopyarea sysfillrect sysimgblt fb_sys_f>
Dec 01 15:54:04 hostname kernel: CR2: 0000000000000930
Dec 01 15:54:04 hostname kernel: ---[ end trace 44dde195cb4c28e6 ]---
Dec 01 15:54:04 hostname kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Dec 01 15:54:04 hostname kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Dec 01 15:54:04 hostname kernel: RSP: 0018:ffffae2c00aebc00 EFLAGS: 00010202
Dec 01 15:54:04 hostname kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Dec 01 15:54:04 hostname kernel: RDX: ffff9de38a28bdc8 RSI: ffffffffffffffff RDI: 0000000000000020
Dec 01 15:54:04 hostname kernel: RBP: ffff9de6186ea9d0 R08: ffffffffc2c1a530 R09: ffff9de6186ea9b0
Dec 01 15:54:04 hostname kernel: R10: ffffffffc1865820 R11: ffff9de614809008 R12: 0000000000000020
Dec 01 15:54:04 hostname kernel: R13: 0000000000000000 R14: ffff9de6186eab38 R15: ffff9de6186eac78
Dec 01 15:54:04 hostname kernel: FS:  0000000000000000(0000) GS:ffff9de670300000(0000) knlGS:0000000000000000
Dec 01 15:54:04 hostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 01 15:54:04 hostname kernel: CR2: 0000000000000930 CR3: 00000001e8e4a006 CR4: 00000000001706e0
Dec 01 15:54:04 hostname kernel: Fixing recursive fault but reboot is needed!

Hi All,
Please share complete nvidia bug report fetched from repro state.

1 Like

As a reminder, bug report can be collected by running nvidia-bug-report.sh and attaching here the generated nvidia-bug-report.log.gz. You may have to add --safe-mode argument if the script hangs.

Just to say that after downgroading the driver to 450.80.02 and keeping the kernel “5.9.11-2-MANJARO” I have no crash anymore.

To be fair, when it was happening to me, the script would hang even after running with --safe-mode. And keep in mind that I had to SSH from my Android phone into my machine in order to run the script, because when this crash happens, it freezes literally everything and you can’t even switch TTYs.

Audio devices list looks normal, player probably lists devices supported by ALSA drivers.

Crash still happening with the patched drivers at 455.23.04: Page allocation failure in kernel module at random points
Can confirm it happening at least 2 more times with the patched drivers. I just hard crashed with these drivers and could not ssh or go into a TTY.

I have recently fully updated to nvidia-dkms 455.45.01-1 on Arch Linux 5.9.12-zen1-1-zen in good faith as of 2020/12/08 it’s been 75 days since the first official report on 2020/08/24 @ 455.23.04: Page allocation failure in kernel module at random points.

This issue is definitely still not fixed and still affecting people. The freezes today are not fully hard and I could go out to a TTY. In fact while typing this message I’ve had it freeze another 2 times both recoverable from a TTY.

I have included 3 bug reports:
0: right after my display froze and I went to TTY2 to run “sudo nvidia-bug-report.sh”
nvidia-bug-report0.log.gz (340.8 KB)

1: while typing up v1 (good thing the website saves posts) of this post I had the screen freeze again. I went to TTY2 to capture this. Then “systemctl restart sddm”
nvidia-bug-report1.log.gz (502.4 KB)

2: Freeze again while typing this current post, I didn’t have to restart sddm but it just worked right after switching back to TTY7 from TTY2.
nvidia-bug-report2.log.gz (555.0 KB)

For some reason once it does happen it likes to keep happening. I’ve just experience another 2 soft freezes on top of the 3 previous reports (recover from TTY2 with no sddm restart) while finishing up this post’s formatting.

nvidia-bug-report3.log.gz (608.5 KB)

Maybe do something with the record $3B gross this quarter (enough to hire 15000*4 devs at 200k per year) and actually fix some stuff?

And to top off my post I’ve had 2 other soft freezes while typing out my napkin maths.
nvidia-bug-report4.log.gz (660.9 KB)

fix. your. drivers.

455.45.01 using VDPAU triggered by watching DVB TV through TV Tuner on VLC with VDPAU enabled. Without VDPAU, e.g. using OpenGL video output in VLC, no freeze is observed. Many Bothans died of boredom while being forced to watch US daytime TV to bring you this information.

This bug report was gathered after a freeze but before a reboot, but even running with the additional args sudo nvidia-bug-report.sh --safe-mode --extra-system-data, a few lines of the bug report script had to be commented out for the report to complete.

nvidia-bug-report.log.gz (90.9 KB)

The GTX 760 at PCI 28:00.0 is being passed through via VFIO, so isn’t being handled by the nvidia driver.

@anon52993935 I peeked at your nvidia-bug-report0.log.gz and it is not a NULL pointer dereference bug, it’s a page allocation failure bug. Which leads me to suspect maybe you didn’t apply the patch when you compiled the kernel module. Did DKMS display a message that it is applying the patch? Or better yet, look at nvkms_alloc disassembly to verify that the allocation size is compared against 4096 (which is what the patch changed):

  1. Find the compiled nvidia-modeset kernel module. On my system it is here: /lib/modules/<kernel-version>/updates/dkms/nvidia-modeset.ko. Note that the kernel version must match the kernel you are running.
  2. Disassemble it with objdump -S nvidia-modeset.ko >nvidia-modeset.S
  3. In nvidia-modeset.S, search for nvkms_alloc function.
  4. In its initial instructions, there will be a cmp $0x1000,%rdi. Here, 0x1000 is 4096, so the patch is applied. The %rdi register may be different, if the compiler generated the code differently in your case. If it says 0x20000 then the patch is not applied.

I am now having this bug with 455.45.01 with the patch posted on the other thread applied and confirmed as Lastique has outlined above in the kernel module. It is triggered by watching kodi with vdpau. It also happened with the previous 450.80.02 driver.

I have attached my bug report log but as others have said it simply would not complete without commenting out certain lines of the script. Also I have most debug and coredump functionality disabled in kernel so may not be much help, I don’t know.

What I do find interesting is that I was using 450.80.02 with kernel 5.9.0 for a couple of months without having the bug once. Then 1 day after installing a slew of updates to my system, excluding kernel and nvidia which I had left unchanged at that point, I first hit this bug, leading me to think it wasn’t directly related to either of those but to some other package I had updated, but I could not see any likely candidates, but then I am no expert and this all could have just been coincidental.

Anyway now with 450.80.02 and 455.45.01 with and without the patch and any 5.9.x kernel this bug is recurring for me. I will have to restore a backup and/or go back to the LTS kernel and do some more testing. If there is any other way to help diagnose and get this resolved I would be happy to hear any suggestion.

nvidia-bug-report.log.gz (68.8 KB)

The nvidia-bug-report.sh hangs with any possible options, but generates some logs anyway.
nvidia-bug-report.log.gz (50.7 KB) nvidia-bug-report2.log.gz (1.2 KB)

Just had my system crash because of what I believe to be this issue. Also had a similar crash yesterday. Both times I had my browser (brave, a chromium derivative) open and I was unable to switch into a TTY. Had to reboot the system by raising the elephant (SysRQ). My card is a Geforce GTX 1060 3GB from Gigabyte.

This is the output of uname -a:

Linux jonasdesktop 5.9.14-arch1-1 #1 SMP PREEMPT Sat, 12 Dec 2020 14:37:12 +0000 x86_64 GNU/Linux

Here’s the error message from the kernel:

Dez 20 11:31:10 jonasdesktop kernel: BUG: kernel NULL pointer dereference, address: 0000000000000020
Dez 20 11:31:10 jonasdesktop kernel: #PF: supervisor read access in kernel mode
Dez 20 11:31:10 jonasdesktop kernel: #PF: error_code(0x0000) - not-present page
Dez 20 11:31:10 jonasdesktop kernel: PGD 800000064ff43067 P4D 800000064ff43067 PUD 0 
Dez 20 11:31:10 jonasdesktop kernel: Oops: 0000 [#1] PREEMPT SMP PTI
Dez 20 11:31:10 jonasdesktop kernel: CPU: 0 PID: 571 Comm: irq/127-nvidia Tainted: P           OE     5.9.14-arch1-1 #1
Dez 20 11:31:10 jonasdesktop kernel: Hardware name: MSI MS-7982/B150M PRO-VDH (MS-7982), BIOS 3.H0 07/10/2018
Dez 20 11:31:10 jonasdesktop kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Dez 20 11:31:10 jonasdesktop kernel: RSP: 0000:ffffa52f40b07be0 EFLAGS: 00010202
Dez 20 11:31:10 jonasdesktop kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Dez 20 11:31:10 jonasdesktop kernel: RDX: ffff9220953e2808 RSI: ffffffffffffffff RDI: 0000000000000020
Dez 20 11:31:10 jonasdesktop kernel: RBP: ffff9220a927d940 R08: ffffffffc276d530 R09: ffff9220a927d920
Dez 20 11:31:10 jonasdesktop kernel: R10: ffffffffc13b8820 R11: ffff9220d01ab808 R12: 0000000000000020
Dez 20 11:31:10 jonasdesktop kernel: R13: 0000000000000000 R14: ffff9220a927daa8 R15: ffff9220a927dbb0
Dez 20 11:31:10 jonasdesktop kernel: FS:  0000000000000000(0000) GS:ffff9220d5c00000(0000) knlGS:0000000000000000
Dez 20 11:31:10 jonasdesktop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dez 20 11:31:10 jonasdesktop kernel: CR2: 0000000000000020 CR3: 000000064876c005 CR4: 00000000003706f0
Dez 20 11:31:10 jonasdesktop kernel: Call Trace:
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv029950rm+0x1b/0x90 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv025474rm+0x18/0x60 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv011691rm+0x13d/0x1c0 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv000083rm+0x12f/0x1a0 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv011619rm+0xff/0x180 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv018449rm+0x1af/0x210 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv018389rm+0xd9a/0xe90 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv018390rm+0xde/0x260 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv018356rm+0x72/0xc0 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv018370rm+0x235/0x2d0 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv026076rm+0x10/0x10 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv018403rm+0xac/0xe0 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv027734rm+0x820/0xdc0 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv007566rm+0x155/0x270 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv027742rm+0x8d/0x180 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? _nv000712rm+0xa9/0x200 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? disable_irq_nosync+0x10/0x10
Dez 20 11:31:10 jonasdesktop kernel:  ? rm_isr_bh+0x1c/0x60 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? nvidia_isr_kthread_bh+0x1b/0x40 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? irq_thread_fn+0x20/0x60
Dez 20 11:31:10 jonasdesktop kernel:  ? irq_thread+0xf5/0x1a0
Dez 20 11:31:10 jonasdesktop kernel:  ? irq_finalize_oneshot.part.0+0xe0/0xe0
Dez 20 11:31:10 jonasdesktop kernel:  ? irq_thread_check_affinity+0xd0/0xd0
Dez 20 11:31:10 jonasdesktop kernel:  ? kthread+0x142/0x160
Dez 20 11:31:10 jonasdesktop kernel:  ? __kthread_bind_mask+0x60/0x60
Dez 20 11:31:10 jonasdesktop kernel:  ? ret_from_fork+0x22/0x30
Dez 20 11:31:10 jonasdesktop kernel:  ? rm_isr_bh+0x1c/0x60 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? nvidia_isr_kthread_bh+0x1b/0x40 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel:  ? irq_thread_fn+0x20/0x60
Dez 20 11:31:10 jonasdesktop kernel:  ? irq_thread+0xf5/0x1a0
Dez 20 11:31:10 jonasdesktop kernel:  ? irq_finalize_oneshot.part.0+0xe0/0xe0
Dez 20 11:31:10 jonasdesktop kernel:  ? irq_thread_check_affinity+0xd0/0xd0
Dez 20 11:31:10 jonasdesktop kernel:  ? kthread+0x142/0x160
Dez 20 11:31:10 jonasdesktop kernel:  ? __kthread_bind_mask+0x60/0x60
Dez 20 11:31:10 jonasdesktop kernel:  ? ret_from_fork+0x22/0x30
Dez 20 11:31:10 jonasdesktop kernel: Modules linked in: rfcomm veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp>
Dez 20 11:31:10 jonasdesktop kernel:  mdio_devres glue_helper rapl snd_hda_core ecdh_generic intel_cstate of_mdio fixed_phy intel_uncore snd_hwdep rfkill pcspkr drm_kms_helper ecc i2c_i801 libphy i2c_smbus snd_pcm cec tpm_crb intel_lpss_pci snd_timer rc_core snd sysco>
Dez 20 11:31:10 jonasdesktop kernel: CR2: 0000000000000020
Dez 20 11:31:10 jonasdesktop kernel: ---[ end trace 27edec6ea959a89f ]---
Dez 20 11:31:10 jonasdesktop kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Dez 20 11:31:10 jonasdesktop kernel: RSP: 0000:ffffa52f40b07be0 EFLAGS: 00010202
Dez 20 11:31:10 jonasdesktop kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Dez 20 11:31:10 jonasdesktop kernel: RDX: ffff9220953e2808 RSI: ffffffffffffffff RDI: 0000000000000020
Dez 20 11:31:10 jonasdesktop kernel: RBP: ffff9220a927d940 R08: ffffffffc276d530 R09: ffff9220a927d920
Dez 20 11:31:10 jonasdesktop kernel: R10: ffffffffc13b8820 R11: ffff9220d01ab808 R12: 0000000000000020
Dez 20 11:31:10 jonasdesktop kernel: R13: 0000000000000000 R14: ffff9220a927daa8 R15: ffff9220a927dbb0
Dez 20 11:31:10 jonasdesktop kernel: FS:  0000000000000000(0000) GS:ffff9220d5c00000(0000) knlGS:0000000000000000
Dez 20 11:31:10 jonasdesktop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dez 20 11:31:10 jonasdesktop kernel: CR2: 0000000000000020 CR3: 000000064876c005 CR4: 00000000003706f0
Dez 20 11:31:10 jonasdesktop kernel: BUG: kernel NULL pointer dereference, address: 0000000000000930
Dez 20 11:31:10 jonasdesktop kernel: #PF: supervisor write access in kernel mode
Dez 20 11:31:10 jonasdesktop kernel: #PF: error_code(0x0002) - not-present page
Dez 20 11:31:10 jonasdesktop kernel: PGD 800000064ff43067 P4D 800000064ff43067 PUD 0
Dez 20 11:31:10 jonasdesktop kernel: Oops: 0002 [#2] PREEMPT SMP PTI
Dez 20 11:31:10 jonasdesktop kernel: CPU: 0 PID: 571 Comm: irq/127-nvidia Tainted: P      D    OE     5.9.14-arch1-1 #1
Dez 20 11:31:10 jonasdesktop kernel: Hardware name: MSI MS-7982/B150M PRO-VDH (MS-7982), BIOS 3.H0 07/10/2018
Dez 20 11:31:10 jonasdesktop kernel: RIP: 0010:mutex_lock+0x10/0x20
Dez 20 11:31:10 jonasdesktop kernel: Code: 03 31 c0 c3 eb d4 0f 1f 40 00 0f 1f 44 00 00 be 02 00 00 00 e9 61 fa ff ff 90 0f 1f 44 00 00 31 c0 65 48 8b 14 25 c0 7b 01 00 <f0> 48 0f b1 17 75 01 c3 eb d6 66 0f 1f 44 00 00 0f 1f 44 00 00 41
Dez 20 11:31:10 jonasdesktop kernel: RSP: 0000:ffffa52f40b07e30 EFLAGS: 00010246
Dez 20 11:31:10 jonasdesktop kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Dez 20 11:31:10 jonasdesktop kernel: RDX: ffff9220cd2b0000 RSI: 0000000000000000 RDI: 0000000000000930
Dez 20 11:31:10 jonasdesktop kernel: RBP: 0000000000000930 R08: 000000000000000f R09: 0000000000000000
Dez 20 11:31:10 jonasdesktop kernel: R10: ffff9220a9c5d800 R11: ffffa52f40b07801 R12: ffff9220cd2b07cc
Dez 20 11:31:10 jonasdesktop kernel: R13: 0000000000000000 R14: 0000000000000001 R15: ffff9220cd2b0000
Dez 20 11:31:10 jonasdesktop kernel: FS:  0000000000000000(0000) GS:ffff9220d5c00000(0000) knlGS:0000000000000000
Dez 20 11:31:10 jonasdesktop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dez 20 11:31:10 jonasdesktop kernel: CR2: 0000000000000930 CR3: 000000064876c005 CR4: 00000000003706f0
Dez 20 11:31:10 jonasdesktop kernel: Call Trace:
Dez 20 11:31:10 jonasdesktop kernel:  perf_event_exit_task+0x30/0x440
Dez 20 11:31:10 jonasdesktop kernel:  ? put_cpu_partial+0x92/0x140
Dez 20 11:31:10 jonasdesktop kernel:  ? kfree+0x40f/0x440
Dez 20 11:31:10 jonasdesktop kernel:  do_exit+0x37f/0xaa0
Dez 20 11:31:10 jonasdesktop kernel:  ? task_work_run+0x5c/0x90
Dez 20 11:31:10 jonasdesktop kernel:  ? do_exit+0x36f/0xaa0
Dez 20 11:31:10 jonasdesktop kernel:  ? kthread+0x142/0x160
Dez 20 11:31:10 jonasdesktop kernel:  ? rewind_stack_do_exit+0x17/0x17
Dez 20 11:31:10 jonasdesktop kernel: Modules linked in: rfcomm veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp>
Dez 20 11:31:10 jonasdesktop kernel:  mdio_devres glue_helper rapl snd_hda_core ecdh_generic intel_cstate of_mdio fixed_phy intel_uncore snd_hwdep rfkill pcspkr drm_kms_helper ecc i2c_i801 libphy i2c_smbus snd_pcm cec tpm_crb intel_lpss_pci snd_timer rc_core snd sysco>
Dez 20 11:31:10 jonasdesktop kernel: CR2: 0000000000000930
Dez 20 11:31:10 jonasdesktop kernel: ---[ end trace 27edec6ea959a8a0 ]---
Dez 20 11:31:10 jonasdesktop kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Dez 20 11:31:10 jonasdesktop kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Dez 20 11:31:10 jonasdesktop kernel: RSP: 0000:ffffa52f40b07be0 EFLAGS: 00010202
Dez 20 11:31:10 jonasdesktop kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Dez 20 11:31:10 jonasdesktop kernel: RDX: ffff9220953e2808 RSI: ffffffffffffffff RDI: 0000000000000020
Dez 20 11:31:10 jonasdesktop kernel: RBP: ffff9220a927d940 R08: ffffffffc276d530 R09: ffff9220a927d920
Dez 20 11:31:10 jonasdesktop kernel: R10: ffffffffc13b8820 R11: ffff9220d01ab808 R12: 0000000000000020
Dez 20 11:31:10 jonasdesktop kernel: R13: 0000000000000000 R14: ffff9220a927daa8 R15: ffff9220a927dbb0
Dez 20 11:31:10 jonasdesktop kernel: FS:  0000000000000000(0000) GS:ffff9220d5c00000(0000) knlGS:0000000000000000
Dez 20 11:31:10 jonasdesktop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dez 20 11:31:10 jonasdesktop kernel: CR2: 0000000000000930 CR3: 000000064876c005 CR4: 00000000003706f0
Dez 20 11:31:10 jonasdesktop kernel: Fixing recursive fault but reboot is needed!

I am also attaching the bug report log, but the error message is not included since I created the file after the reboot and the script seems to only include the events since last boot.
nvidia-bug-report.log.gz (297.2 KB)

EDIT: Code block contained wrong error message. (Just noticed that I have many occurances of the NVIDIA driver causing null pointer dereference errors in my log - the oldest one is from October 18th)