Bug report: 455.23.04 - Kernel Panic due to NULL pointer dereference

This is not a nVidia issue.
This is not a nVidia issue.
This is not a nVidia issue.

a confluence of displayport,xhci, renesas, browser, config, integration issues

This isnt about fixing your specific issue rather this entire thread.

I’ve just read every post, bug report and log extract on this thread.
This is a super easy fix.
Firstly to clarify;
Linux is not supposed to work out of box.
Thats a Closed Source Market Standard.
The Open Source End User is “FREE”;
to “finish” the Open Source product to a Closed Source Market definition of “state of finish”

So the near total majority of posts are:

  • nVidia Driver 455.
  • Arch Linux,
  • Kernel 5.8+ to 5.9.1 Lowlatency / Pre-emptive
  • Intel LGA 1155, 1151, 1151v2, 1150
  • kernel NULL pointer dereference, address: 0000000000000020
  • kernel NULL pointer dereference, address: 0000000000000027
  • Chromium and Chromium based browsers.Chrome, Opera, Falkon
  • Firefox

“Extract from Chromium ArchLinux Wiki”

Hardware video acceleration

  • There is no official support from Chromium or Arch Linux for this feature (Chromium Docs - VA-API), but you may ask for help in the dedicated forum thread.

  • chromium from official repositories is compiled with VA-API support.

  • For proprietary NVIDIA support, installing libva-vdpau-driver-chromiumAUR or libva-vdpau-driver-vp9-gitAUR is required.

  • Wayland is not supported.

  • To use VA-API on XWayland, use the --use-gl=egl flag. Currently exhibits choppiness FS#67035. It could be solved by enabling #Native Wayland support.

  • To use VA-API on Xorg, use the --use-gl=desktop flag.

  • Starting in Chromium 86, there will be support for VA-API when using the ANGLE gl renderer. Use the --enable-accelerated-video-decode to enable it on an Intel GPU."

BTW, Hows ARCH working out for ya!?

If the your system isn’t configured and integrated as per / The Book and the above Browsers aren’t dury rigged with workarounds then this WILL exponentially exacerbate and exploit the poor system integration, configuration coupled with the lack of support in the kernel or other such issues.

Correct Bios Settings are critical.
Correct Kernel parameters are critical

I also saw multiple posts using PowerSave aswell in some form.
This affects the nVidia driver aswell. It wants to ramp up and is getting choked.
Disbale all power management for PCI express.

The biggest issue is BIOS DMA Buffer/ VM / IOMMU and xHCI settings and support.
Is xHCI handover still enabled in the BIOS?
USB 2/ PS/2 Legacy support uses VM is the BIOS.

Kernel 5.8
In Arch Linux and Manjaro 5.8+ kernel has issues with Renesas USB controllers due to a FW version check issue.

Kernel EDID patch: 20201203
Removed the Skylake/Kabylake platform detection logic and makes the edid function work on all platforms. Regardless, with the patch, a kernel oops occurs on the function intel_vgpu_reg_rw_edid in drivers/drm/i915/kvmgt.c.

outtatime

So … the bug began on nvidia driver 450, not only Arch Linux, kernel 5.4 and 4.19 are also hit and the crash happened without any browser in some cases (vlc for example).

For the Hardware video acceleration, without this feature, the crash happened too (yes, I tried again and again and now I just compile the 435 for my kernel and I don’t have any crash).

I don’t use the PowerSave (because of nvidia, this problem is here for few years now), the problem with the firmware version is solved for few months (and to clarify, previous kernel have the same bug with nvidia driver).

Why this bug don’t hit the nvidia’s driver before 450 and why without chrome or whatever with the Hardware acceleration, this bug still happened ?

My questions are not really questions, It’s just to compare with the post above @abelits

ps : thanks for all instructive links

1 Like

Please stop. The kernel crash on dereferencing a NULL pointer in a driver’s function is probably the most conclusive and unambiguous indication that a bug is in the driver.

2 Likes

Most likely because it was introduced in that version.

Because “not using hardware acceleration” option in one userspace program does not reliably prevent any particular piece of driver’s functionality from being used, especially in a modern desktop that uses compositing for everything. Also the problem is probably in the implementation of some basic functionality, most likely a race condition in something very common. The number of calls may affect the likeliness of a crash, however it can’t be eliminated entirely.

If a guess made by @generix is correct, and preemption is either the necessary condition or it greatly increases the probability of a crash being triggered, it would strongly indicate a race condition.

Following recent posts here I have been testing today running kernel 5.10.18 compiled with CONFIG_PREEMPT_NONE=y set, otherwise default config, and the latest 460.39 driver.

So far I have been unable to reproduce the crash whilst watching video in Kodi (For me always the trigger of crash) in around 6 hours.

But of course it is not absolutely reliable to reproduce in such a time frame, having said that previously I could not get past 3/4 days uptime whilst using kodi each day before getting the crash. So I will see how it goes and report back if I run into it in the coming days.

We have fix available for similar kind of issue and its fix in our latest release which is available to download on below link.

Please test with the above driver and share the feedback.

5 Likes

We have fix available for similar kind of issue

What issue do you refer to as “similar”?

What in particular was done to fix it?

1 Like

Have installed driver 460.56 on Manjaro. Will give it a few days to see if it fixes the issue and report back.

I installed 450.56 on manjaro almost immediately after notification about the post here and still running it. no freezings yet. will report here in a few days.

I have reproduced this issue by playing a full-screen video in Kodi on HDMI-0 output.
This time I couldn’t collect logs (ssh didn’t work).

xrandr output:

HDMI-0 connected 2560x1440+5120+0 (normal left inverted right x axis y axis) 608mm x 345mm
   3840x2160     30.00 +  29.97    25.00    23.98  
   2560x1440     59.95* 
   1920x1080     60.00    59.94    50.00    29.97    23.98  
   1680x1050     59.95  
   1600x900      60.00  
   1440x900      59.89  
   1280x1024     75.02    60.02  
   1280x800      59.81  
   1280x720      60.00    59.94    50.00  
   1152x864      75.00  
   1024x768      75.03    70.07    60.00  
   800x600       75.00    72.19    60.32    56.25  
   720x576       50.00  
   720x480       59.94  
   640x480       75.00    72.81    59.94  
DP-0 connected primary 5120x1440+0+0 (normal left inverted right x axis y axis) 1mm x 1mm
   3840x1080     99.96 +  59.97  
   5120x1440    100.00*   59.98  
   2560x1440     59.95  
   2560x1080    100.00    60.00    59.94  
   1920x1080    100.00    60.00    59.94  
   1680x1050     59.95  
   1600x900      60.00  
   1440x900      59.89  
   1280x1024     75.02    60.02  
   1280x800      59.81  
   1280x720      60.00  
   1152x864      75.00  
   1024x768      75.03    70.07    60.00  
   800x600       75.00    72.19    60.32    56.25  
   640x480       75.00    72.81    59.94 

OS: ArchLinux
Nvidia drivers: 460.56
Kernel: 5.11.1-zen1-1-zen kernel (Arch Linux).

No freezings yet.
Using kernel 5.11.1-arch1-1
(Obviously using 460.56)

Hi, i use Arch Linux with a Intel i5-3450 CPU and a GTX 650 Ti. I know i am almost to the fucking cutting line of the support for the driver… But i reinstalled my Arch Linux last week after a fuck up with BTRFS. My PC was running fine, was updating everyday while also hibernating for the night until i got an issue with chrome freezing (it always happen with so much tabs i got). So i applied more update and rebooted… Now, with the latest update, i opened chrome and noticed everything started to fuck up… Youtube was using 80% of the GPU with memory leaking. I soon realise Discord, Steam and all video player was also suffering intensely. It seem video acceleration issue started up on the update… In the last 10 days or so… So off course it was driving me crazy. I downgraded a shit load of video package and everything’s fine now. All i could find back on google was this thread… I use the chaotic-aur’s TKG suit of Kernel/Nvidia Driver/Mesa and more… So off course when i was reading this thread i realise YOU GUYS FUCK UP THE PATCH FOR THAT ISSUE… Making people who had no issue with the video driver now being throw with everyone else having issue… Off course it’s upstream issue. I am fairly sure it’s not TKG since all they do’s reapply the patch that was working fine for month from your source… Off course i am in dire need of upgrading my GPU, i got a lot of vintage keyboard i plan to trade for PC part including GPU… But idk, i might just go AMD if that was not them also having issue once in a while. I kinda just wanna game too… One of my friend offered me more recent nvidia GPU but i am very unsure if i wanna bite the dust now with that upgrade. Since i had downgraded, i cannot remake the issue without upgrading back, but since i am moving on monday, i need to have my PC working, so i will abstain from doing it.

TL;DR : You guys introduced Video Acceleration Issue on my GTX 650 Ti running Arch Linux.

For the sake of everything that is, was, will be or might be sacred, instead of this wall of text, post:

  1. Nvidia driver version number.
  2. Kernel version number (and better the output of uname -a).
  3. Types of failures (graphics distortion, uneven video playback speed, slowdown, high CPU load, graphics or full computer lockup, kernel panic if kernel or logs are collected).
  4. Software used.

Right, Nvidia recommendations are not very useful and their collections scripts are often not accessible at the time of failure. Nevertheless, please post something that qualifies as a bug report.

“Linux HNT-Quad-ROS 5.10.15-120-tkg-bmq #1 TKG SMP PREEMPT Mon, 15 Feb 2021 02:15:43 +0000 x86_64 GNU/Linux” which are the ivybridge version of tkg-bmq.

Drivers are chaotic-nvidia-dkms-tkg-460.39.6 (The time of posting the update match with when the issue started).

The issue was video acceleration glitching, low frame rate on video, high cpu load with memory leaking on chrome when video acceleration was on, had to disable it to be fine.

Software, everything using video acceleration : Chrome, Steam Store Video, Discord, VLC and other Media Player.

Downgrading many package related to video drivers seem to fix it.

Looks like a problem with 460.39 support of older GPUs. It should be reported separately with this information and last working driver version.

This thread is about a different problem – one that seems to affect all GPUs and causes a kernel panic, is present in 460.39, and might be fixed in 460.56.

Two days in using driver 460.56 and so far no crashes. Won’t count my chickens just yet but it’s looking promising.

1 Like

fifth day of using new fixed driver. still no crashes while pc enabled almost all day. The bug is fixed I suppose.
Manjaro, 5.10.18, Nvidia 460.56

I’m jealous. I’ve been reporting nvkms crashdumps in the ‘stable’ driver for over a -year- and you guys got NVidia to fix the issue in less than 5 months!

@amrits @aplattner

1 Like

I had the second crash since I have started using the newest drivers (460.56/5.11.1-arch1-1).

The crash has happened when I was away from the keyboard for around 12 minutes.

This time I was able to connect through ssh and collect logs using nvidia-bug-report.sh --safe-mode --extra-system-data

nvidia-bug-report.sh --safe-mode --extra-system-data

nvidia-bug-report.log.gz (91.9 KB)

Just to follow up my previous post I have been running the older problematic driver 460.39 with 5.10 kernel compiled with preemption disabled and not had the crash once with almost 7 days uptime, doing the same activity as was causing crash every day.

So I would say from my albeit limited testing you guys were quite probably correct here.

Going to try the latest driver now with my normal kernel config with preemption, looks good so far based on lack of reports here so far, so hopefully they fixed it this time.

I took peak at your bug report and I don’t think that is the same problem, at least the log looks different than all the others from this thread.