DRM framebuffer memory corruption / presentation Optimus (idle system)

I am experiencing massive framebuffer memory corruption of content presented via Nvidia encoders (I am using Linux kernel language Kernel Mode Setting (KMS) — The Linux Kernel documentation)

20220605_160826

As evident from the photo, this is not a text console framebuffer, but the DRM framebuffer that goes through the Nvidia DRM connector.

This is a Dell Inspiron 16 7610, Tiger Lake-H, with Intel iGPU and Nvidia RTX 3060 dGPU, where the Nvidia GPU fully controls the encoders of the single USB-C output, for presenting on external screens (and where the Intel GPU is connected to the laptop display). Connected to USB-C output is a 4K screen via DisplayPort, and a 2.5K screen (making for a total of three connected, active screens).

This framebuffer corruption makes it very hard to actually use the external screen; this corruption appears and disappears at random times, in random sizes, and at random places on the screen connected to the Nvidia GPU.

This happens even on a largely idle system, i.e. X11 is running, with KDE, and I perform very light work with Firefox or Chrome. It also happens with “heavier” work, e.g. scrolling in text editors.

One peculiar way of provoking the problem is slowly moving a window of Visual Studio Code (which is a Chrome / Electron app with GPU acceleration) from the laptop screen towards the external screen; then wiggle a little bit.

This system is configured such that the Intel GPU is primary, while the Nvidia GPU is in the (default) PRIME offload configuration as shipped by rpmfusion for Fedora 36 (driver version 5.10; I will be updating to 5.15 once it becomes available).

I have attached the Nvidia bug report script output. In there, please note

  • the very aggressive X.org logging for maximum detail (nothing suggest erratic conditions, though)
  • no hacks or otherwise in configuration

It might be worthwhile pointing out that there is considerable screen real estate present: 3072 x 1920 on the laptop, 3840 x 2160 + 2560 x 1440 externally, all at 60 Hz.

From a naive point of view, it would appear as if the corruption is due to bad synchronization; I wouldn’t rule out that some kind of pressure exposes timing issues / CPU races / data races that wouldn’t otherwise be seen.

nvidia-bug-report.log.gz (1.3 MB)