Bug report: 455.23.04 - Kernel Panic due to NULL pointer dereference

Enterprise customers mostly use NVIDIA for CUDA-related workflows (irrelevant for this issue as CUDA doesn’t even need the Xorg server to be running) or for CAD (unlikely to be affected as most people, including me, only get this issue while using Google Chrome).

1 Like

I welcome you to inspect: [FreeDesktop Bugzilla] and [kernel bugzilla].

Never said it was bug free at any point. Please don’t be so presumptuous and put words in my mouth to further your own argument. My point was that AMD at least has open sourced drivers so in cases like this where the dev doesn’t fix it the community can. Does that always happen or even often? No, ofc not. But at least there are drivers. Please let me remind everyone here again that this page exists.
https://nouveau.freedesktop.org/wiki/PowerManagement

and this lovely warning on the arch wiki
**Warning:** The support for reclocking is highly experimental. Manually setting the power state may hang your system, cause corruption or overheat your card.

Fills me with confidence that I’m a valued customer.
We don’t even have power management support. You don’t need to be a genius to see that AMD is clearly superior in open source driver support no matter how buggy they are; at least they exist in some usable form. We literally have no alternative. The Nouveau drivers are absolute garbage and basically unusable not that the proprietary ones are that much better. Case in point, this bug right now and how Nvidia have handled it.

“Hey why don’t you just use an old version, get a life and stop complaining!!!”

The NVIDIA driver package from the Arch repos only bundles it for one kernel version and the DKMS fails to build with kernels sufficiently old enough. At some point this will stop being an nvidia driver only problem and turn into a full blown actual security risk.

Have a casual browse of the links below:
https://security.archlinux.org/package/linux
https://security.archlinux.org/CVE-2020-16119
→ Hadar Manor reported that by reusing a DCCP socket with an attached dccps_hc_tx_ccid as a listener, in Linux <= 5.9, it will be used after being released, leading to a denial of service or possibly code execution.

Oh, like literally this second. Where the 450.66 build only has support for 5.8.10-arch1-1. Yes, there are patches and workarounds, that can be ran but if you are still trying to defend Nvidia at this point you are really grasping at straws now and need to reassess your life to see why you are defending a faceless billion $ corporation that have no interest in you other than your money.

Quote:
One of my coworkers added a fix to avoid memory allocations in critical code paths, but it was fairly invasive and considered too risky for the release branch.
However, part of the change can be applied as a patch to existing drivers. While it’s not considered a complete fix, it might be worth trying.

OFC the devs at Nvidia are not dumb, I’m pretty sure they would be smarter than most people myself included. But “considered too risky for the release branch.”, so the last 2+ months of drivers were not?

I feel like such a valued customer right now after being filled with such confidence. Not being told 1 to 1 over an email or phone support like any other normal company would provide. But instead crammed into a forum for developers and getting a message once a blue moon because we can’t use the normal one. I made an account just to voice my disapproval with this situation, I’m certain there are lurkers who don’t have an account following this thread.

You cannot imagine how much value is in there and how diligently NVIDIA attends to their needs.

I can and that is precisely why I bought it up thinking how incredibly stupid it is for them to do such a thing.

Any how I’ve researched this a bit more and you are correct.
@abelits you may also be interested to know. Datacenter customers have an older driver. That is assuming that they actually do run these drivers and not getting the latest unstable version from their package manger like a normal user.

Nvidia has datacentre drivers 450.51.05 released 2020.7.7.

Which begs a different question. Are we just guinea pigs for the data center customers? If not why don’t we get a stable release branch like them when it’s the same damn kernel and instead have an unstable main line driver for 2+ months now. There is no logical reason for this, which implies that this is an artificial limitation imposed by NVIDIA for whatever reason. Conspiracy theories aside, this is a real issue that still has not been addressed. The driver packages are still in the repos and any unluckly unknowing customers will experience a hard lock after their driver update and will lose data like everyone else here already has.

My point was that AMD at least has open sourced drivers so in cases like this where the dev doesn’t fix it the community can. Does that always happen or even often? No, ofc not. But at least there are drivers.

Graphics drivers and power management are currently the most complex areas of software development. Open Source drivers mean crap when only full-time talented developers can solve issues and that’s exactly what’s happening with Open Source Intel and AMD drivers: close to 99.9% of code in them is written by Intel/AMD employees, not your imaginary users. I’m absolutely sure you haven’t submitted a single patch for the said drivers. The f*** you are talking about?!

It’s all just f___ing posturing to vilify NVIDIA and praise AMD. If AMD open source drivers are so f___ing great, what the f___ are you even doing on this website? Go sell your NVIDIA card and enjoy the 5700XT which will be faster and greater.

I feel like such a valued customer right now after being filled with such confidence.

You’re using a God-forsaken OS and expecting too much from it. Stop. If you really cared about your productivity and stability, you wouldn’t be using Linux in the first place. Install Windows 10 LTSC and enjoy the world of high-quality, relatively bug-free software with tons of features and software.

Nvidia has datacentre drivers 450.51.05 released 2020.7.7.

Nope, it’s 450.80.02

Which begs a different question. Are we just guinea pigs for the data center customers?

In some ways we surely are because datacenter customers are normally using proven enterprise Linux kernels, e.g. RHEL, not some mainline crap which no one can vouch for and which normally contains tons of regressions which take months, sometimes years to be resolved.

I mean Linux kernel 5.4 was released with an egregious regression which made booting on a wide range of devices impossible for f’s sake. I refused to use this kernel for 2 months but seeing that no one f___ing cared I went ahead and raped my laptop to find a regression. Took me 5 hours and over 30 reboots to find it. Again, literally hundreds of thousands of affected devices and no one gave a f___. What kind of quality can you expect from an OS which rejects the whole notion of QA/QC?

You want to really use your hardware and not f*** around? Give up on a cesspool called Linux if you value your time and nerves.

Could we please keep this civilized and on topic. I understand very well how people can be frustrated by different things—and rightfully so—and I’m sure many people here, including me, sympathize with these feelings, at the very least because we are all human beings. But let’s also show respect for one another for the very same reason and not burden other people with our emotions, unless it’s absolutely necessary and unavoidable. In any case, emotions and, for that matter, off-topic discussions, don’t help in fixing bugs—it’s rather the opposite.

6 Likes

I experience the same problem as the people above, with a Quadro P4000 and the current 455 driver series: 455.28, 455.38 and 455.45.01.

The problem however does not occur if I use the 450.80.02 LLB driver from September 30th, 2020.

With the 455 drivers, this happens in a variety of situations:

  • using the GPU accelerated terminal emulator alacritty
  • using the video player mpv (with full GPU hardware decoding and image treatment shaders)
  • using Firefox
  • using 3D graphics software such as Blender
  • performing CUDA computations

I do not have specific steps to provide to reproduce it, it happens at random, in many different situations.

However, I can relate the following:

  • the crashes are less frequent with 455.45.01 than with the earlier 455 drivers
  • with the 455.45.01 driver, the crashes do not necessarily happen early after boot, but when they start, they get more frequent, and only a reboot makes the system usable for some time
  • the first crashes can be recovered from by switching to a virtual console and back to the X server, but after some time, that trick doesn’t work anymore, and a reboot is needed
  • however, the kernel is still running, I can use the SysReq sequences (e.g. SysReq+REISUB) to reboot the system cleanly, and I can ssh into the machine
  • when the trick of switching to a virtual console and back works, the affected programs respond in different ways:
    • mpv needs to be shutdown and restarted
    • Firefox usually hangs up for a few dozen seconds, and then starts working again. However, sometimes it starts behaving badly
    • alacritty requires sometimes to detach my tmux session and reattaching it in a new alacritty session
    • other programs experience graphical problems, or recover normally

The configuration of my system is as follows:

  • archlinux as a distribution
  • the linux LTS kernel (currently 5.4.79-1-lts)
  • the awesomewm tiling window manager
  • xorg
  • a 1440p display connected to the P4000 via displayport

To those affected by the problem:

  • with the linux LTS kernel, I experience no crashes using the 450.80.02 LLB driver.

To those using arch linux:

  • I recompile the 450.80.02 driver every time the kernel is updated via a pacman hook, and a modified PKGBUILD derived from the nvidia-lts PKGBUILD (that can be obtained via ‘asp checkout nvidia-lts’). See the arch wiki on how to write pacman hooks and PKGBUILD files.
  • Also, note that only the ‘nvidia-lts’ package has to be recompiled with each new kernel version (which does not take very long on my machine). The packages ‘nvidia-settings’, ‘nvidia-utils’ and ‘opencl-nvidia’ only have to be built once. The kernel specific parts are in ‘nvidia-lts’.
  • However, you need to compile them once, and I did so using a PKGBUILD derived from ‘nvidia-utils’ (asp checkout nvidia-utils). Note that to build the 3 invariant packages, you only need to build ‘nvidia-utils’, as the PKGBUILD builds the 2 other packages at the same time.
  • Finally, to install the 4 packages when you have built them, you need to install them at the same time using a ‘pacman -U’ command for instance, to replace the 4 packages from the distribution without conflicts. After the initial installation, you only need to update (pacman -U) the ‘nvidia-lts’ package.

I do not know if that works with a non LTS linux kernel.

I hope that can help others stuck in that unpleasant situation.

2 Likes

Can you test if the patch from 455.23.04: Page allocation failure in kernel module at random points - #55 by aplattner helps?

Thanks for your report. It helps. I downgraded the driver but not the kernel. I will keep informed.

Man I don’t get it, why are you so heated on this convo, there is no need for this sort of language when talking to me or anyone else on the internet. Nvidia is not being nice to you and you are still defending them.

I don’t want to drag on this convo for along longer than it needs to be as you are already too heated for any reasonable convo to happen.

Just saying stuff like this would never fly in windows. I could not imagine any drivers on windows forcing a user to keep an unsafe older kernel and people like you still defending them. Like I said before and I’ll say it again, nvidia does not care about anything about you except your money. If any of the devs would like to provide a public statement I’ll be happy to listen but your crudeness is just uncalled for when I thought I had a well structured argument and even referenced.

If you do work for nvidia then I’m sorry if I offended you in anyway shape or form but it’s not you that I’m annoyed at like I said in my first post, Nvidia treats Linux users like crap and they will continue to. I simply felt like this was not voiced at all in this thread.

I’m not gonna fire back on any of your statements even though they are quite self absorbed and give off a massive “idgaf and neither should you” vibe. You can take this win cus like @kerberizer have said, it’s uncivil and unneeded. I didn’t feel the need to insult you and you shouldn’t have felt that need to insult me either, reflect on why you felt that and why you felt like saying those things to me. I think it’ll make you a better person in the long run. Don’t mean to offend you or any one else here.

Again, nvidia like AMD (and any other corp) doesn’t care about anything about you except your money. I’m not an AMD shill like you have tried to paint me, I’m trying to voice my frustration at a company that doesn’t care about the consumer. Enjoy your nvidia GPU and the rest of your day. I’m out.

This patch seems to work for me on Linux 5.9.10-zen1-1-zen for a while. I have had a freeze after leaving my PC playing a YouTube video on Firefox so I don’t think it’s fully stable but definitely more stable than just running 455.45.01.

Hi Yuannan,

Please capture nvidia bug report once you hit with issue again and share with us.

Requesting others also to create and share nvidia bug report when issue is triggered.
We do have few bug reports already in this forum but needed few more to understand issue as we are having trouble reproducing issue locally.

1 Like

Thank you amrits,

I’ll try and set up SSH as some users have reported it still working even after the freeze. I’ll try and get journalctl logs as well if they are written before the hard freeze. Sorry, didn’t mean to just provide an empty report but I literally couldn’t in this case.

It does seem a bit weird that you are having trouble reproducing this. I have some experience in operating systems programming but not on this level but from what I can gather and how the patch has helped it might be a stack/buffer overflow and corruption of some essential calls stacks. I wonder if there is a safe and sanitary way to dump the memory as well as the register logs in order to assist in debugging. The current logs do provide registers but it’s not much help if you don’t know how it got there. It would be extremely invasive for the drivers to do so by default but I’d be happy to run a logging daemon for the purposes of squashing this bug. Again not sure how great of an idea this is and how helpful it could be considering how invasive this is.

Also don’t know if this is related as I’ve been ignoring it for a while now.
image
In my music player (Tauon Music Player) it has multiple devices under “HDA Nvidia at 0xf6080000 irq 71” despite me setting the pulseaudio profile to off, essentially disabling it. I don’t see any Nvidia devices in pavucontrol or under “pacmd list-sinks” once I have disabled this. Clicking one will also select all of them and in doing so freeze the program until I force kill it. Tauon is fully FOSS afiak, so might be worth looking into it’s code as this bug does seem to be triggered by audio/media/vdpau. Don’t know enough to fully comment on the implications but I thought I would bring it up.

I’m sorry if any of my previous posts came off in the wrong way, I do greatly appreciate your work and don’t mean to attack anyone in particular. There are dozens of us using it everyday! Dozens! However I do still feel Nvidia does not treat Linux users with anywhere near the same amount of respect and stability as Windows users at all.

Thank you :), lmk if there is anything else I can do to help.

This is what I get now before a freeze on 455.45.01-4:

nvidia-modeset/: page allocation failure: order:4, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 2 PID: 666 Comm: nvidia-modeset/ Tainted: P           OE     5.9.11-arch2-1 #1
Hardware name: ASUS All Series/Z87I-DELUXE, BIOS 1104 05/29/2014
Call Trace:
 dump_stack+0x6b/0x83
 warn_alloc.cold+0x78/0xdc
 ? __alloc_pages_direct_compact+0x140/0x160
 __alloc_pages_slowpath.constprop.0+0xcdd/0xd10
 ? _nv002242kms+0x380/0x6f0 [nvidia_modeset]
 __alloc_pages_nodemask+0x2f2/0x320
 kmalloc_order+0x28/0x80
 kmalloc_order_trace+0x1d/0xb0
 __kmalloc+0x266/0x2a0
 nvkms_alloc+0x20/0x50 [nvidia_modeset]
 _nv002718kms+0x16/0x30 [nvidia_modeset]
 ? _nv002593kms+0x4e/0x1610 [nvidia_modeset]
 ? _nv002426kms+0x40/0x40 [nvidia_modeset]
 ? _nv000550kms+0x365/0x3c0 [nvidia_modeset]
 ? _nv002680kms+0x309/0x3c0 [nvidia_modeset]
 ? _nv002698kms+0x29f/0x540 [nvidia_modeset]
 ? schedule+0x50/0xf0
 ? schedule_timeout+0x12d/0x170
 ? preempt_count_add+0x68/0xa0
 ? _raw_spin_lock_irq+0x1a/0x40
 ? __down_interruptible+0x94/0x100
 ? _nv000528kms+0x71/0x80 [nvidia_modeset]
 ? nvkms_kthread_q_callback+0x7c/0xd0 [nvidia_modeset]
 ? _main_loop+0x83/0x130 [nvidia_modeset]
 ? nvkms_sema_up+0x10/0x10 [nvidia_modeset]
 ? kthread+0x142/0x160
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x22/0x30
Mem-Info:
active_anon:464077 inactive_anon:712988 isolated_anon:0
 active_file:181075 inactive_file:179078 isolated_file:0
 unevictable:40 dirty:25006 writeback:0
 slab_reclaimable:31846 slab_unreclaimable:59060
 mapped:234022 shmem:111527 pagetables:21484 bounce:0
 free:62549 free_pcp:315 free_cma:0
Node 0 active_anon:1856308kB inactive_anon:2851952kB active_file:724300kB inactive_file:716312kB unevictable:160kB isolated(anon):0kB isolated(file):0kB mapped:936088kB dirty:100024kB writeback:0kB shmem:446>
Node 0 DMA free:14872kB min:132kB low:164kB high:196kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984kB managed:159>
lowmem_reserve[]: 0 3407 7858 7858 7858
Node 0 DMA32 free:95564kB min:95056kB low:102368kB high:109680kB reserved_highatomic:0KB active_anon:789856kB inactive_anon:1303888kB active_file:298044kB inactive_file:378464kB unevictable:16kB writepending>
lowmem_reserve[]: 0 0 4450 4450 4450
Node 0 Normal free:139760kB min:124144kB low:133692kB high:143240kB reserved_highatomic:0KB active_anon:1065936kB inactive_anon:1548064kB active_file:425792kB inactive_file:337976kB unevictable:144kB writepe>
lowmem_reserve[]: 0 0 0 0 0
Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 1*2048kB (M) 3*4096kB (M) = 14872kB
Node 0 DMA32: 8219*4kB (UME) 4378*8kB (UME) 1122*16kB (UME) 311*32kB (UME) 2*64kB (ME) 1*128kB (E) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 96060kB
Node 0 Normal: 21713*4kB (UME) 4375*8kB (UME) 1037*16kB (UME) 26*32kB (UME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 139276kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
565730 total pagecache pages
116826 pages in swap cache
Swap cache stats: add 2099453, delete 1982869, find 531471/860277
Free swap  = 3767696kB
Total swap = 6835600kB
2080339 pages RAM
0 pages HighMem/MovableOnly
59797 pages reserved
0 pages hwpoisoned

About a month ago, I posted here that I was going to give the LTS kernel (5.4) a try.

So, I’ve been using it for this one month period, and I haven’t experienced a single crash yet. I’ve been also updating the NVIDIA driver like always (currently using 455.45.01).

But since some people have said they were experiencing the crash even with the LTS kernel, I might have been lucky…

1 Like

linux 5.9.9.arch1-1
nvidia 455.45.01

Dec 01 15:54:04 hostname kernel: BUG: kernel NULL pointer dereference, address: 0000000000000020
Dec 01 15:54:04 hostname kernel: #PF: supervisor read access in kernel mode
Dec 01 15:54:04 hostname kernel: #PF: error_code(0x0000) - not-present page
Dec 01 15:54:04 hostname kernel: PGD 80000001e8601067 P4D 80000001e8601067 PUD 0 
Dec 01 15:54:04 hostname kernel: Oops: 0000 [#1] PREEMPT SMP PTI
Dec 01 15:54:04 hostname kernel: CPU: 2 PID: 632 Comm: irq/38-nvidia Tainted: P           OE     5.9.9-arch1-1 #1
Dec 01 15:54:04 hostname kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97M Pro4, BIOS P1.90A 06/25/2015
Dec 01 15:54:04 hostname kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Dec 01 15:54:04 hostname kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Dec 01 15:54:04 hostname kernel: RSP: 0018:ffffae2c00aebc00 EFLAGS: 00010202
Dec 01 15:54:04 hostname kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Dec 01 15:54:04 hostname kernel: RDX: ffff9de38a28bdc8 RSI: ffffffffffffffff RDI: 0000000000000020
Dec 01 15:54:04 hostname kernel: RBP: ffff9de6186ea9d0 R08: ffffffffc2c1a530 R09: ffff9de6186ea9b0
Dec 01 15:54:04 hostname kernel: R10: ffffffffc1865820 R11: ffff9de614809008 R12: 0000000000000020
Dec 01 15:54:04 hostname kernel: R13: 0000000000000000 R14: ffff9de6186eab38 R15: ffff9de6186eac78
Dec 01 15:54:04 hostname kernel: FS:  0000000000000000(0000) GS:ffff9de670300000(0000) knlGS:0000000000000000
Dec 01 15:54:04 hostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 01 15:54:04 hostname kernel: CR2: 0000000000000020 CR3: 00000001e8e4a006 CR4: 00000000001706e0
Dec 01 15:54:04 hostname kernel: Call Trace:
Dec 01 15:54:04 hostname kernel:  ? _nv029950rm+0x1b/0x90 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv025474rm+0x18/0x60 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv011691rm+0x13d/0x1c0 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv000083rm+0x12f/0x1a0 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv036719rm+0xc3/0x350 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv036718rm+0x5c/0x70 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv011615rm+0x78/0xd0 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv011615rm+0x1a/0xd0 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv024757rm+0x251/0x3e0 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv024706rm+0x25/0x150 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv015453rm+0x9b/0x270 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv026077rm+0x290/0x290 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv027734rm+0x273/0xdc0 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv007566rm+0x155/0x270 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv027742rm+0x8d/0x180 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? _nv000712rm+0xa9/0x200 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? disable_irq_nosync+0x10/0x10
Dec 01 15:54:04 hostname kernel:  ? rm_isr_bh+0x1c/0x60 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? nvidia_isr_kthread_bh+0x1b/0x40 [nvidia]
Dec 01 15:54:04 hostname kernel:  ? irq_thread_fn+0x20/0x60
Dec 01 15:54:04 hostname kernel:  ? irq_thread+0xf5/0x1a0
Dec 01 15:54:04 hostname kernel:  ? irq_finalize_oneshot.part.0+0xe0/0xe0
Dec 01 15:54:04 hostname kernel:  ? irq_thread_check_affinity+0xd0/0xd0
Dec 01 15:54:04 hostname kernel:  ? kthread+0x142/0x160
Dec 01 15:54:04 hostname kernel:  ? __kthread_bind_mask+0x60/0x60
Dec 01 15:54:04 hostname kernel:  ? ret_from_fork+0x22/0x30
Dec 01 15:54:04 hostname kernel: Modules linked in: fuse xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE xt_multiport nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo ip6table_filter ip6_tables xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrac>
Dec 01 15:54:04 hostname kernel:  snd_compress ff_memless input_leds mc joydev ac97_bus snd_hda_core mousedev drm_kms_helper cec snd_hwdep snd_pcm_dmaengine mei_me i2c_nvidia_gpu rc_core e1000e snd_pcm intel_gtt syscopyarea sysfillrect sysimgblt fb_sys_f>
Dec 01 15:54:04 hostname kernel: CR2: 0000000000000020
Dec 01 15:54:04 hostname kernel: ---[ end trace 44dde195cb4c28e5 ]---
Dec 01 15:54:04 hostname kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Dec 01 15:54:04 hostname kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Dec 01 15:54:04 hostname kernel: RSP: 0018:ffffae2c00aebc00 EFLAGS: 00010202
Dec 01 15:54:04 hostname kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Dec 01 15:54:04 hostname kernel: RDX: ffff9de38a28bdc8 RSI: ffffffffffffffff RDI: 0000000000000020
Dec 01 15:54:04 hostname kernel: RBP: ffff9de6186ea9d0 R08: ffffffffc2c1a530 R09: ffff9de6186ea9b0
Dec 01 15:54:04 hostname kernel: R10: ffffffffc1865820 R11: ffff9de614809008 R12: 0000000000000020
Dec 01 15:54:04 hostname kernel: R13: 0000000000000000 R14: ffff9de6186eab38 R15: ffff9de6186eac78
Dec 01 15:54:04 hostname kernel: FS:  0000000000000000(0000) GS:ffff9de670300000(0000) knlGS:0000000000000000
Dec 01 15:54:04 hostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 01 15:54:04 hostname kernel: CR2: 0000000000000020 CR3: 00000001e8e4a006 CR4: 00000000001706e0
Dec 01 15:54:04 hostname kernel: BUG: kernel NULL pointer dereference, address: 0000000000000930
Dec 01 15:54:04 hostname kernel: #PF: supervisor write access in kernel mode
Dec 01 15:54:04 hostname kernel: #PF: error_code(0x0002) - not-present page
Dec 01 15:54:04 hostname kernel: PGD 80000001e8601067 P4D 80000001e8601067 PUD 0 
Dec 01 15:54:04 hostname kernel: Oops: 0002 [#2] PREEMPT SMP PTI
Dec 01 15:54:04 hostname kernel: CPU: 2 PID: 632 Comm: irq/38-nvidia Tainted: P      D    OE     5.9.9-arch1-1 #1
Dec 01 15:54:04 hostname kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97M Pro4, BIOS P1.90A 06/25/2015
Dec 01 15:54:04 hostname kernel: RIP: 0010:mutex_lock+0x10/0x20
Dec 01 15:54:04 hostname kernel: Code: 03 31 c0 c3 eb d4 0f 1f 40 00 0f 1f 44 00 00 be 02 00 00 00 e9 61 fa ff ff 90 0f 1f 44 00 00 31 c0 65 48 8b 14 25 c0 7b 01 00 <f0> 48 0f b1 17 75 01 c3 eb d6 66 0f 1f 44 00 00 0f 1f 44 00 00 41
Dec 01 15:54:04 hostname kernel: RSP: 0018:ffffae2c00aebe30 EFLAGS: 00010246
Dec 01 15:54:04 hostname kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Dec 01 15:54:04 hostname kernel: RDX: ffff9de6186fdb80 RSI: 0000000000000000 RDI: 0000000000000930
Dec 01 15:54:04 hostname kernel: RBP: 0000000000000930 R08: 000000000000000f R09: 0000000000000000
Dec 01 15:54:04 hostname kernel: R10: ffff9de618685c00 R11: ffffae2c00aeb801 R12: ffff9de6186fe34c
Dec 01 15:54:04 hostname kernel: R13: 0000000000000000 R14: 0000000000000001 R15: ffff9de6186fdb80
Dec 01 15:54:04 hostname kernel: FS:  0000000000000000(0000) GS:ffff9de670300000(0000) knlGS:0000000000000000
Dec 01 15:54:04 hostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 01 15:54:04 hostname kernel: CR2: 0000000000000930 CR3: 00000001e8e4a006 CR4: 00000000001706e0
Dec 01 15:54:04 hostname kernel: Call Trace:
Dec 01 15:54:04 hostname kernel:  perf_event_exit_task+0x30/0x440
Dec 01 15:54:04 hostname kernel:  ? put_cpu_partial+0x92/0x140
Dec 01 15:54:04 hostname kernel:  ? kfree+0x40f/0x440
Dec 01 15:54:04 hostname kernel:  do_exit+0x37f/0xaa0
Dec 01 15:54:04 hostname kernel:  ? task_work_run+0x5c/0x90
Dec 01 15:54:04 hostname kernel:  ? do_exit+0x36f/0xaa0
Dec 01 15:54:04 hostname kernel:  ? kthread+0x142/0x160
Dec 01 15:54:04 hostname kernel:  ? rewind_stack_do_exit+0x17/0x17
Dec 01 15:54:04 hostname kernel: Modules linked in: fuse xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE xt_multiport nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo ip6table_filter ip6_tables xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrac>
Dec 01 15:54:04 hostname kernel:  snd_compress ff_memless input_leds mc joydev ac97_bus snd_hda_core mousedev drm_kms_helper cec snd_hwdep snd_pcm_dmaengine mei_me i2c_nvidia_gpu rc_core e1000e snd_pcm intel_gtt syscopyarea sysfillrect sysimgblt fb_sys_f>
Dec 01 15:54:04 hostname kernel: CR2: 0000000000000930
Dec 01 15:54:04 hostname kernel: ---[ end trace 44dde195cb4c28e6 ]---
Dec 01 15:54:04 hostname kernel: RIP: 0010:_nv027527rm+0x9/0x90 [nvidia]
Dec 01 15:54:04 hostname kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Dec 01 15:54:04 hostname kernel: RSP: 0018:ffffae2c00aebc00 EFLAGS: 00010202
Dec 01 15:54:04 hostname kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Dec 01 15:54:04 hostname kernel: RDX: ffff9de38a28bdc8 RSI: ffffffffffffffff RDI: 0000000000000020
Dec 01 15:54:04 hostname kernel: RBP: ffff9de6186ea9d0 R08: ffffffffc2c1a530 R09: ffff9de6186ea9b0
Dec 01 15:54:04 hostname kernel: R10: ffffffffc1865820 R11: ffff9de614809008 R12: 0000000000000020
Dec 01 15:54:04 hostname kernel: R13: 0000000000000000 R14: ffff9de6186eab38 R15: ffff9de6186eac78
Dec 01 15:54:04 hostname kernel: FS:  0000000000000000(0000) GS:ffff9de670300000(0000) knlGS:0000000000000000
Dec 01 15:54:04 hostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 01 15:54:04 hostname kernel: CR2: 0000000000000930 CR3: 00000001e8e4a006 CR4: 00000000001706e0
Dec 01 15:54:04 hostname kernel: Fixing recursive fault but reboot is needed!

Hi All,
Please share complete nvidia bug report fetched from repro state.

1 Like

As a reminder, bug report can be collected by running nvidia-bug-report.sh and attaching here the generated nvidia-bug-report.log.gz. You may have to add --safe-mode argument if the script hangs.

Just to say that after downgroading the driver to 450.80.02 and keeping the kernel “5.9.11-2-MANJARO” I have no crash anymore.

To be fair, when it was happening to me, the script would hang even after running with --safe-mode. And keep in mind that I had to SSH from my Android phone into my machine in order to run the script, because when this crash happens, it freezes literally everything and you can’t even switch TTYs.

Audio devices list looks normal, player probably lists devices supported by ALSA drivers.

Crash still happening with the patched drivers at 455.23.04: Page allocation failure in kernel module at random points - #63 by michaelmberlinger
Can confirm it happening at least 2 more times with the patched drivers. I just hard crashed with these drivers and could not ssh or go into a TTY.

I have recently fully updated to nvidia-dkms 455.45.01-1 on Arch Linux 5.9.12-zen1-1-zen in good faith as of 2020/12/08 it’s been 75 days since the first official report on 2020/08/24 @ 455.23.04: Page allocation failure in kernel module at random points.

This issue is definitely still not fixed and still affecting people. The freezes today are not fully hard and I could go out to a TTY. In fact while typing this message I’ve had it freeze another 2 times both recoverable from a TTY.

I have included 3 bug reports:
0: right after my display froze and I went to TTY2 to run “sudo nvidia-bug-report.sh”
nvidia-bug-report0.log.gz (340.8 KB)

1: while typing up v1 (good thing the website saves posts) of this post I had the screen freeze again. I went to TTY2 to capture this. Then “systemctl restart sddm”
nvidia-bug-report1.log.gz (502.4 KB)

2: Freeze again while typing this current post, I didn’t have to restart sddm but it just worked right after switching back to TTY7 from TTY2.
nvidia-bug-report2.log.gz (555.0 KB)

For some reason once it does happen it likes to keep happening. I’ve just experience another 2 soft freezes on top of the 3 previous reports (recover from TTY2 with no sddm restart) while finishing up this post’s formatting.

nvidia-bug-report3.log.gz (608.5 KB)

Maybe do something with the record $3B gross this quarter (enough to hire 15000*4 devs at 200k per year) and actually fix some stuff?

And to top off my post I’ve had 2 other soft freezes while typing out my napkin maths.
nvidia-bug-report4.log.gz (660.9 KB)

fix. your. drivers.