Xorg nvidia module cpu utilization is high

Hi -

X is taking up a larger amount of CPU than normal. It usually sits around 50% (as seen in top), and will sometimes cap out at 100%. When it does so, keyboard repeats are slow (and maybe other things). Once I alt-tab, the utilization goes back to 50% or so and the keyboard repeats are fine. I need to repeat this process every 30 seconds or so.

From https://bugs.gentoo.org/633328.

I noticed this originally with 384.90. I still have the same problem on 387.22.

perf top suggests the hotspot is nvidia_drv.so at 0x86e4a and nearby locations:

.
    21.78%  nvidia_drv.so           <li> 0x0000000000086e4a
     6.11%  nvidia_drv.so           <li> 0x000000000008af5c
     4.70%  nvidia_drv.so           <li> 0x0000000000086e56

It looks like a busywait (down below. extra ASM for context. this is from 384.90):

.
   86e30:       48 85 f6                test   %rsi,%rsi
   86e33:       74 28                   je     86e5d <nvidiaAddDrawableHandler@@Base+0xbf0d>
   86e35:       48 8b 46 08             mov    0x8(%rsi),%rax
   86e39:       48 85 c0                test   %rax,%rax
   86e3c:       74 1d                   je     86e5b <nvidiaAddDrawableHandler@@Base+0xbf0b>
   86e3e:       80 78 20 00             cmpb   $0x0,0x20(%rax)
   86e42:       75 0e                   jne    86e52 <nvidiaAddDrawableHandler@@Base+0xbf02>
   86e44:       eb 1c                   jmp    86e62 <nvidiaAddDrawableHandler@@Base+0xbf12>


hot loop:

   86e46:       80 78 20 00             cmpb   $0x0,0x20(%rax)
   86e4a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
   86e50:       74 09                   je     86e5b <nvidiaAddDrawableHandler@@Base+0xbf0b>
   86e52:       48 8b 40 08             mov    0x8(%rax),%rax
   86e56:       48 85 c0                test   %rax,%rax
   86e59:       75 eb                   jne    86e46 <nvidiaAddDrawableHandler@@Base+0xbef6>


   86e5b:       f3 c3                   repz retq
   86e5d:       48 8b 07                mov    (%rdi),%rax
   86e60:       eb d7                   jmp    86e39 <nvidiaAddDrawableHandler@@Base+0xbee9>
   86e62:       f3 c3                   repz retq

Thanks!
nvidia-bug-report.log.gz (48.9 KB)

You should really remove your xorg.conf, that’s a mess.

The X driver polls the GPU when it’s waiting for things that shouldn’t take a long time to complete. If you’re seeing a significant amount of CPU time spent in there, then

a) the X server is processing some kind of request or event that requires waiting for the GPU, and
b) the GPU is being unusually slow at processing whatever it is.

Your bug report doesn’t have any output from nvidia-smi because it looks like it’s not installed. Can you please install it and see if it lists any processes other than the X server using a significant amount of GPU resources? Since you seem adept at profiling and disassembling the X driver, you could also try getting a backtrace from the X server to see what it was doing when it called into the driver.

Thanks for the response.

Here’s the SMI output. That third process is chrome.

.
# /opt/bin/nvidia-smi
Wed Nov  8 20:07:44 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.22                 Driver Version: 387.22                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 950     Off  | 00000000:01:00.0  On |                  N/A |
| 33%   34C    P5    12W /  90W |    188MiB /  1996MiB |     33%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3855      G   /usr/bin/X                                    96MiB |
|    0      4062      G   /usr/lib64/firefox/firefox                     1MiB |
|    0      4874      G   ...-token=DB1DFF97AA93937B0E6F44FBF809F203    85MiB |
+-----------------------------------------------------------------------------+

From the pmon, X pops up higher on sm utilization, if I’m reading this correctly.

.
# /opt/bin/nvidia-smi pmon    
# gpu     pid  type    sm   mem   enc   dec   command
# Idx       #   C/G     %     %     %     %   name
    0    3855     G    48     3     0     0   X              
    0    4062     G     0     0     0     0   firefox        
    0    4874     G     0     0     0     0   chrome --type=g
    0    3855     G    43     2     0     0   X              
    0    4062     G     0     0     0     0   firefox        
    0    4874     G     4     0     0     0   chrome --type=g
    0    3855     G    42     3     0     0   X              
    0    4062     G     0     0     0     0   firefox        
    0    4874     G     8     0     0     0   chrome --type=g
    0    3855     G    44     3     0     0   X              
    0    4062     G     0     0     0     0   firefox        
    0    4874     G     4     0     0     0   chrome --type=g
    0    3855     G    51     3     0     0   X              
    0    4062     G     0     0     0     0   firefox        
    0    4874     G     0     0     0     0   chrome --type=g
    0    3855     G    49     3     0     0   X              
    0    4062     G     0     0     0     0   firefox        
    0    4874     G     3     0     0     0   chrome --type=g

I tried to get perf record to give me a backtrace that would point into X. The frame pointer and dwarf method didn’t work, most likely since it can’t backtrace out of the nvidia_drv.so. I tried the lbr method, and it gave me some info, though I’m not 100% on whether I got the right info or not.

Here’s the relevant snippets from perf record -g --call-graph lbr -p X_PID for 10 seconds, followed by perf report -g caller.

.
    79.24%     0.00%  X            nvidia_drv.so          <li> 0xffff802ee2e0ae12
            |
            ---0x4be12
               |
               |--57.14%--0x985d7
               |          |
               |          |--55.32%--0x980a6
               |          |          0x83000
               |          |
               |           --0.82%--0x97f70
               |
                --21.46%--0x9856b
                          |
                           --20.70%--0x98391
                                     |
                                     |--19.55%--0x87170
                                     |
                                      --0.58%--0x8731d
                                                |
                                                 --0.56%--0x98cf1

    78.76%     0.09%  X            Xorg                   <li> Dispatch
            |
             --78.67%--Dispatch
                       |
                       |--69.51%--ProcCreatePixmap
                       |          |
                       |           --68.32%--0x6011a8
                       |                     |
                       |                      --68.12%--0x5ff59a
                       |                                |
                       |                                 --68.09%--0x4be12
                       |                                           |

It looks like X is calling Dispatch->ProcCreatePixmap. Then we lose it for a bit, and we see 0x4be12. That’s the address that pops up in nvidia_drv.so. I haven’t tried the lbr method before, so I might be interpreting this data the wrong way.

Does ProcCreatePixmap ring any bells, especially something that might call into nvidiaAddDrawableHandler? I see a couple function pointers in there, which also lines up with what I see in xorg’s assembly (maybe).

Thanks

nvidiaAddDrawableHandler is a red herring, it’s just the nearest symbol so everything gets reported as an offset from that.

The fact that the server is calling ProcCreatePixmap a lot means that some client is sending that request frequently. My guess is Chrome. Does the Chrome task manager (press shift+escape to open it) show any particular tabs open that are using a lot of CPU time?

If that doesn’t pan out, I would suggest using xlsclients and then start killing clients one by one until you find the one that’s causing the high Xorg load.