Nvidia X11 driver busy-polls kernel on clock_gettime in a tight loop

shoffmeister · June 6, 2022, 6:16pm

On my system, the Nvidia 515 release X11 driver keeps busy polling the Linux kernel for clock_gettime (through libc) in a tight, busy loop:

On an otherwise idle system, this consumes up to 40%

This problem does not seem to be new; it has been reported in High CPU usage on xorg when the external monitor is plugged in before, but the follow-up postings seem to have watered down the very nice technical starting posting. I am therefore creating this very specific fresh topic.

What you see above is a nice rendition of the Xorg process created by GitHub - janestreet/magic-trace: magic-trace collects and displays high-resolution traces of what a process is doing - this is simply perf, but allowing for a friendlier presentation in the form of a flamegraph (sudo magic-trace attach -pid $PID_OF_XORG - the output can then be rendered in multiple ways, I have a strictly local server running for this, simple to do)

This nice rendition shows that on my idle Tiger Lake-H 8 core system, almost all of the CPU is consumed by Xorg, and there by what appears to be a tight loop inside the (closed source) nvidia_drv, the Nividia X11 driver module.

This RTX 3060 Optimus notebook is running Fedora 36, latest kernel, latest Mesa, latest KDE, latest X, on top of

Intel GPU serves (only) the internal display (and HDMI)
Nvidia GPU serves (only) the USB-C output (via DisplayPort Alternate Mode to a DisplayPort)
Intel is the primary GPU, Nvidia is offloading

Notebook screen: 3072x1920 @ 60.14 Hz
External screen: 3840 x 12160 @ 60.00 Hz (no G-Sync)

This problems gets reduced a little bit by forcing the GPU to prefer maximum performance; this clocks up everything although I seem to be only consuming bandwidth for memory transfer to the Nvidia controlled connector / crtc. And clocking up everything creates loads of heat and noise.

A good way to demonstrate this problem on my notebook is to run

sudo nvidia-smi --reset-memory-clocks
sudo nvidia-smi --lock-memory-clocks=100,100

to force the GPU into a lower power state (don’t worry about the 100,100 - apparently nvidia-smi will auto-correct that). Things do get a little but laggy, but suddenly CPU consumption is even higher, 50+%, and many many more clock_gettime calls (in a busy loop).

So looking at this from the outside, there is a strong correlation between low memory speed on the GPU and (very high) CPU utilization from busy polling the Linux kernel clock_gettime.

But why, and how can this be stopped please?

I only want the Nvidia driver to show that framebuffer content it was handed (crtc → port); in PRIME offload, it doesn’t produce anything on top of that, unless I tell it to do so.

BenjaminN13 · September 5, 2022, 11:00am

Now that the driver code is open source maybe you could compile it with debug info enabled and re-run your experiment? It would probably allow to pin point the root cause of this issue. NVIDIA doesn’t seem to be very interested to investigate this one…

shoffmeister · September 5, 2022, 5:57pm

The Nvidia graphics stack for X11 comprises of two components (gross oversimplification): The Linux kernel graphics interface (nvidia_drm / nividia_modesetting, opensourced at GitHub - NVIDIA/open-gpu-kernel-modules: NVIDIA Linux open GPU kernel module source) and the X11 graphics driver itself. The X11 graphics driver, “nvidia”, remains closed (and is huge):

lsmod | grep nvidia

nvidia_drm             73728  0
nvidia_modeset       1146880  1 nvidia_drm
nvidia_uvm           1286144  0
nvidia              40849408  74 nvidia_uvm,nvidia_modeset

The second column is the module size in bytes. All that polling and calling of clock_gettime is made in loops, inside nvidia (the X11 component).

I believe Nvidia have no plans to open up the X11 part (and reviewing something that ends up being a 40 MB binary for opensourcing is not exactly something that would offer terrific business value)

In all honesty, I do not have the energy to trace through compiled code which has been stripped of almost all supportive metadata (ELF symbols, debug symbols, …) - and that lack of metadata also implies that any tooling one might be able to employ has its hands tied behind its back.

The only recourse is for someone with knowledge of the overall driver architecture and with access to the source code to root cause this. I fear the end result might be a simple “works as designed” (where the design constraint might stem from non-technical issues).

xkill · October 21, 2022, 8:02am

I managed to reduce to CPU usage by forcing to use only nvidia GPU, following the instructions at:

The X11 process changed from 20%-40% (constantly) to 0%, and the nvidia GPU is OK.

vaihoheso · December 10, 2022, 5:45pm

Great analysis! Obviously, the driver is polling for something in a busy loop, checking the clock on each iteration for a timeout. The timeout it is waiting for probably depends on the clock speed, that is why the CPU consumption is higher when the GPU clock speed is low? What could it be? Memory transfer between the system memory and the GPU?

stephematician · August 28, 2023, 11:55am

In a totally different context: I’ve noticed that the NVIDIA driver hammers clock_gettime when I’m running a MediaPipe task. I discovered this via libprofiler and pprof. See https://github.com/google/mediapipe/issues/4650#issuecomment-1694154373 for more details. I don’t have much insight to offer as I’m not a serious dev - my guess is that there’s some (unintentional?) spinlocks lying around.

vaihoheso · August 28, 2023, 4:03pm

Great analysis! The reason is the same. The driver waits for memory transfer between the system memory and the GPU in a busy loop, calling clock_gettime to check for a timeout. NVIDIA engineers could fix it, if they paid attention to this forum. They don’t.

stephematician · August 29, 2023, 12:40pm

Thanks @vaihoheso and I think you are right about where the issue occurs.

I understand the frustration, however, I would also say I am really impressed with how well everything works these days! I play lots of (modern) games without issue on my laptop w/- Ubuntu and an RTX 2060 (mostly thanks to proton/wine devs of course, but also NVIDIA, too).

I hope they eventually find a way to solve this issue via some kind of yield + interrupt. I’m sure they have to triage issues depending on how many users are affected and how troubling the issue is.

vaihoheso · August 29, 2023, 4:34pm

Yes. According to TeamBlind, NVIDIA engineers most of the time are busy with implementing new features and writing demos for Jensen’s talks. One has to create a huge amount of noise to get a bug fixed. Also, it can take several years of back and forth.

vaihoheso · August 29, 2023, 5:02pm

If it were me, I would have two modes of waiting. For low performance p-states, a conditional variable would do, for high performance p-states, a busy loop would be better.

stephematician · August 29, 2023, 11:04pm

According to TeamBlind, NVIDIA engineers most of the time are busy with implementing new features and writing demos for Jensen’s talks.

@vaihoheso That might be true - but on the other hand I think closed-sourcing is the bigger issue. It prevents easy wins like this from being patched up. And yes while I am impressed with how well things work - that doesn’t mean there isn’t room for improvement.

sam180 · December 29, 2023, 4:46am

Hi, I’ve got a repro of this without X11 or Wayland, using kmscube. I’m using the open source drivers (nvidia-open). Using perf I can see libnvidia-eglcore.so.545.29.06 appears to be polling clock_gettime while for the OP it was nvidia_drv.so.

On their GitHub NVIDIA says the open kernel modules must be used with “user-space NVIDIA GPU driver components”. I’m assuming libnvidia-eglcore.so.545.29.06 is part of the closed-source user-space components? Since I’m not using X11 does anyone know if it’s at all possible to fix this issue within the open-source code?

Topic		Replies	Views
High CPU usage on xorg when the external monitor is plugged in Linux	120	38523	June 21, 2023
Xorg nvidia module cpu utilization is high Linux	5	2490	October 14, 2021
X hangs using 100% CPU, WAIT and mieq overflowing errors in logs Linux	67	23575	June 28, 2014
440.48.02: Random X.org lock ups due to kernel module crash Linux	17	6996	December 22, 2020
nvidia-smi not fully supported on GTX 1060 Linux	41	39246	January 17, 2018
Very(!) slow ramp down from high to low clock speeds leading to a significantly increased power cons Linux	159	26043	February 6, 2024
GPU timeout \| lockup Linux	14	1416	July 7, 2024
Display freezes: (EE) NVIDIA(GPU-0): WAIT Linux	28	9278	April 10, 2025
Ubuntu 20.04 - NVIDIA GPU consuming power even when using only integrated graphics card (Intel iGPU) Linux	40	9976	December 21, 2022
Reproducible: NVRM: GPU at 0000:01:00.0 has fallen off the bus. -- Both screens black, Xorg at 100% Linux	24	51002	December 16, 2015

Nvidia X11 driver busy-polls kernel on clock_gettime in a tight loop

Related topics