Performance drop to slideshow after ~20 minutes of CSGO play with Vulkan

nvidia-bug-report.log.gz (293.4 KB)

When playing CSGO using Vulkan, after about 20 minutes of gameplay, the performance goes to the level of slideshow quality.

Running nVIDIA RTX 3080Ti with NVIDIA-Linux-x86_64-515.48.07.
Linux cyberdeath-pc 5.18.0-1-rt11-MANJARO #1 SMP PREEMPT_RT Sat May 28 15:43:17 CEST 2022 x86_64 GNU/Linux

I have attached my bug report to this thread.

Related Github bug reports placed with Valve: 2901 & 2891

Seeing a lot of these in my logs:

Jun 15 01:32:01 cyberdeath-pc kernel: BUG: scheduling while atomic: irq/219-s-nvidi/685/0x00000003
Jun 15 01:32:01 cyberdeath-pc kernel: Modules linked in: nvidia_uvm(POE) joydev mousedev snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device mc usbhid intel_rapl_msr asus_nb_wmi eeepc_wmi iTCO_wdt intel_pmc_bxt iTCO_vendor_support>
Jun 15 01:32:01 cyberdeath-pc kernel:  snd_timer cec snd i2c_i801 thunderbolt soundcore i2c_smbus igc mei_me mei drm_kms_helper intel_lpss_pci syscopyarea sysfillrect intel_lpss sysimgblt fb_sys_fops idma64 pmt_telemetry agpgart pmt_cl>
Jun 15 01:32:01 cyberdeath-pc kernel: Preemption disabled at:
Jun 15 01:32:01 cyberdeath-pc kernel: [<0000000000000000>] 0x0
Jun 15 01:32:01 cyberdeath-pc kernel: CPU: 3 PID: 685 Comm: irq/219-s-nvidi Tainted: P        W  OE     5.18.0-1-rt11-MANJARO #1
Jun 15 01:32:01 cyberdeath-pc kernel: Hardware name: ASUS System Product Name/ROG MAXIMUS Z690 HERO, BIOS 1505 05/31/2022
Jun 15 01:32:01 cyberdeath-pc kernel: Call Trace:
Jun 15 01:32:01 cyberdeath-pc kernel:  <TASK>
Jun 15 01:32:01 cyberdeath-pc kernel:  dump_stack_lvl+0x44/0x58
Jun 15 01:32:01 cyberdeath-pc kernel:  __schedule_bug.cold+0x81/0x8e
Jun 15 01:32:01 cyberdeath-pc kernel:  __schedule+0xeb5/0x1240
Jun 15 01:32:01 cyberdeath-pc kernel:  ? push_rt_tasks+0x13/0x20
Jun 15 01:32:01 cyberdeath-pc kernel:  ? raw_spin_rq_unlock+0x17/0x60
Jun 15 01:32:01 cyberdeath-pc kernel:  ? rt_mutex_setprio+0x1be/0x480
Jun 15 01:32:01 cyberdeath-pc kernel:  schedule_rtlock+0x1e/0x40
Jun 15 01:32:01 cyberdeath-pc kernel:  rtlock_slowlock_locked+0x3c8/0xe90
Jun 15 01:32:01 cyberdeath-pc kernel:  rt_spin_lock+0x3f/0x60
Jun 15 01:32:01 cyberdeath-pc kernel:  ___slab_alloc.constprop.0+0x83/0x650
Jun 15 01:32:01 cyberdeath-pc kernel:  ? os_acquire_spinlock+0xe/0x20 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  ? _nv034974rm+0xc/0x20 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  ? _raw_spin_unlock_irqrestore+0x23/0x60
Jun 15 01:32:01 cyberdeath-pc kernel:  ? _nv012124rm+0x40/0x90 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  ? _nv039562rm+0x13/0x60 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  ? _nv035680rm+0x19/0xb0 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  kmem_cache_alloc_trace+0x6e/0x1c0
Jun 15 01:32:01 cyberdeath-pc kernel:  nv_post_event+0x95/0x140 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  _nv034983rm+0x59/0x80 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  ? _nv033036rm+0xab/0xc0 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  ? _nv031255rm+0xf4/0x120 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  nv_post_event+0x95/0x140 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  _nv034983rm+0x59/0x80 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv033036rm+0xab/0xc0 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv031255rm+0xf4/0x120 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv022962rm+0x6b/0xb0 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv022962rm+0x76/0xb0 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv023096rm+0xf47/0x11d0 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv023567rm+0x8c/0x1a0 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv026078rm+0x5e/0xc0 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv010470rm+0x19f/0x310 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv026088rm+0x14f/0x1b0 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv000648rm+0x10b/0x140 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? disable_irq_nosync+0x10/0x10
Jun 14 23:59:11 cyberdeath-pc kernel:  ? rm_isr_bh+0x1c/0x60 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? nvidia_isr_kthread_bh+0x1b/0x40 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? irq_thread_fn+0x1c/0x60
Jun 14 23:59:11 cyberdeath-pc kernel:  ? irq_thread+0xfd/0x1a0
Jun 14 23:59:11 cyberdeath-pc kernel:  ? irq_finalize_oneshot.part.0+0xd0/0xd0
Jun 14 23:59:11 cyberdeath-pc kernel:  ? irq_thread_check_affinity+0xd0/0xd0
Jun 14 23:59:11 cyberdeath-pc kernel:  ? kthread+0x107/0x130
Jun 14 23:59:11 cyberdeath-pc kernel:  ? kthread_complete_and_exit+0x20/0x20
Jun 14 23:59:11 cyberdeath-pc kernel:  ? ret_from_fork+0x1f/0x30
Jun 14 23:59:11 cyberdeath-pc kernel:  </TASK>

Looks like there’s an interrupt storm happening, did this also happen with the 510 driver? Did you already try reseating the nvidia in its pcie slot?

@generix Thanks for the prompt reply and my apologies on the delay. I reinstalled Manjaro again with the latest build as of yesterday. It was less buggy but I also learned that I don’t see any interrupt storms until I install the rt kernel. However, when playing CSGO with vulkan, I’m still getting jitter after about 20 minutes of gameplay. I captured another debug log when it was happening in hopes that it will shed some light on what’s going on. Please see attached.

nvidia-bug-report.log.gz (275.0 KB)

As a quick update: I updated to the latest beta version available from the Vulkan page ( 515.49.05), which did not help and still have stuttering during gameplay.

I also installed the RT kernel again and, while it initially didn’t throw interrupt storms, I now since see that it’s throwing them often again.

uname -a
Linux cyberdeath-pc 5.18.0-1-rt11-MANJARO #1 SMP PREEMPT_RT Sat May 28 15:43:17 CEST 2022 x86_64 GNU/Linux

Including the following:

Let me know if any further information is needed.

The interrupt storm is also happening with the normal desktop preempt kernel, there are just no races happening so you don’t see it in dmesg. Also, not only the nvidia is affected but also usb.
After only 30 minutes of uptime, xhci is at 1.5M interrupts, nvidia at 4M interrupts. Please check /proc/interrupts and maybe disconnect usb accessories one after another to test if you can find some device triggering this.

@generix I ran watch and specifically looked at the xhci as you mentioned. Even with leaving my keyboard and mouse connected, it was not incrementing. However, as soon as I hit a key or moved the mouse, the numbers incremented, particularly with the mouse. I know that the polling rate on this mouse is 1000Hz and I know it’s also high on the keyboard.

The nVIDIA interrupt is incrementing irregardless of any activity on the system.

PCI devices:

                      TYPE            BUS   CLASS  VENDOR  DEVICE   CONFIGS

   Mass storage controller   0000:00:17.0    0106    8086    7ae2         0
                    Bridge   0000:00:1c.0    0604    8086    7ab8         0
     Serial bus controller   0000:00:15.1    0c80    8086    7acd         0
                    Bridge   0000:00:1f.0    0601    8086    7a84         0
   Mass storage controller   0000:02:00.0    0108    144d    a80a         0
                    Bridge   0000:00:01.0    0604    8086    460d         0
   Mass storage controller   0000:08:00.0    0108    144d    a80a         0
  Communication controller   0000:00:16.0    0780    8086    7ae8         0
        Display controller   0000:01:00.0    0300    10de    2208         5
                    Bridge   0000:00:1b.0    0604    8086    7ac0         0
     Serial bus controller   0000:00:1f.5    0c80    8086    7aa4         0
   Mass storage controller   0000:07:00.0    0108    144d    a80a         0
                    Bridge   0000:00:1c.3    0604    8086    7abb         0
     Multimedia controller   0000:00:1f.3    0403    8086    7ad0         0
                    Bridge   0000:00:00.0    0600    8086    4668         0
                    Bridge   0000:00:1c.1    0604    8086    7ab9         0
     Serial bus controller   0000:00:15.2    0c80    8086    7ace         0
                    Bridge   0000:00:1d.4    0604    8086    7ab4         0
        Network controller   0000:06:00.0    0200    8086    15f3         0
     Serial bus controller   0000:00:15.0    0c80    8086    7acc         0
                    Bridge   0000:00:06.0    0604    8086    464d         0
   Mass storage controller   0000:05:00.0    0106    1b21    0612         0
                    Bridge   0000:00:1d.0    0604    8086    7ab0         0
   Mass storage controller   0000:00:0e.0    0104    8086    467f         0
     Multimedia controller   0000:01:00.1    0403    10de    1aef         0
         Memory controller   0000:00:14.2    0500    8086    7aa7         0
     Serial bus controller   0000:00:14.0    0c03    8086    7ae0         0
     Serial bus controller   0000:00:1f.4    0c05    8086    7aa3         0

Signal processing controller 0000:00:0a.0 1180 8086 467d 0

USB devices:

                      TYPE            BUS   CLASS  VENDOR  DEVICE   CONFIGS

                     Mouse      1-4.4:1.0   10503    3057    0001         0
                       Hub        2-8:1.0   10a00    174c    3074         0
       Unclassified device      1-4.1:1.2    0000    1b1c    1b73         0
                       Hub       1-10:1.0   10a00    058f    6254         0
       Unclassified device        1-7:1.6    0000    0b05    1a27         0
     Multimedia controller      1-4.3:1.1    0401    00ff    ff00         0
                  Keyboard      1-4.1:1.0   10800    1b1c    1b73         0
       Unclassified device      1-4.2:1.0    0000    0f1b    1006         0
       Unclassified device      1-4.3:1.2    0000    00ff    ff00         0
                       Hub        1-0:1.0   10a00    1d6b    0002         0
                       Hub        1-4:1.0   10a00    058f    6254         0
                       Hub        1-8:1.0   10a00    174c    2074         0
       Unclassified device        1-5:1.2    0000    0b05    18f3         0
                       Hub        2-0:1.0   10a00    1d6b    0003         0

@generix Here’s another debug report in case that helps.

nvidia-bug-report.log.gz (1.5 MB)

I don’t really know where that’s coming from. Did you already try to reseat the nvidia board in its slot? Please also check with a 5.17 or earlier kernel, if possible.

@generix I did reseat the card and I also was running 5.15 previously. I’ll boot to 5.15 and if I get any different result, will edit this post. Otherwise, it didn’t help. :-(

[cyberdeath-pc cyberdeath]# cat /proc/interrupts | grep ‘xhci|nvidia’
16: 0 0 0 0 0 0 0 0 12601 0 0 0 IR-IO-APIC 16-fasteoi nvidia
189: 0 0 0 0 0 0 0 0 6234 0 0 0 IR-PCI-MSI 327680-edge xhci_hcd
[cyberdeath-pc cyberdeath]# uptime
00:44:58 up 1 min, 3 users, load average: 1.88, 0.86, 0.32
[cyberdeath-pc cyberdeath]# uname -a
Linux cyberdeath-pc 5.15.48-1-MANJARO #1 SMP PREEMPT Thu Jun 16 12:33:56 UTC 2022 x86_64 GNU/Linux

[cyberdeath-pc cyberdeath]# cat /proc/interrupts | grep ‘xhci|nvidia’
16: 0 0 0 0 0 0 0 0 0 0 43336 0 IR-IO-APIC 16-fasteoi nvidia
189: 0 0 0 0 0 0 0 0 13459 0 0 0 IR-PCI-MSI 327680-edge xhci_hcd
[cyberdeath-pc cyberdeath]# uptime
00:38:46 up 3 min, 3 users, load average: 0.44, 0.53, 0.25
[cyberdeath-pc cyberdeath]# uname -a
Linux cyberdeath-pc 5.15.44-1-rt46-MANJARO #1 SMP PREEMPT_RT Mon Jun 6 13:47:12 CEST 2022 x86_64 GNU/Linux

[cyberdeath-pc cyberdeath]# cat /proc/interrupts | grep ‘xhci|nvidia’
16: 0 0 0 0 0 0 0 0 0 336558 0 0 IR-IO-APIC 16-fasteoi nvidia
189: 0 0 0 0 0 0 0 0 97956 0 0 0 IR-PCI-MSI 327680-edge xhci_hcd
[cyberdeath-pc cyberdeath]# uptime
00:33:56 up 18 min, 3 users, load average: 0.31, 0.68, 0.58
[cyberdeath-pc cyberdeath]# uname -a
Linux cyberdeath-pc 5.18.0-1-rt11-MANJARO #1 SMP PREEMPT_RT Sat May 28 15:43:17 CEST 2022 x86_64 GNU/Linux

Are there any other debug tools that would give more insight into the reason for the interrupts?

Unfortunately no, since interrupts are emitted by the hardware in general. I guess you’ll have to go the hard way, testing your components one-by-one, starting with the nvidia board. Please check if it works in another system.

@generix I tested with an RTX2060 in the same system and I get the same behavior. So it does not appear to be the graphics card itself, but still could be drivers/software. I installed the open source version as a test and also got the same behavior.

nvidia-bug-report.log.gz (1.6 MB)

Edit: I also ran GPU Burn on my 3080Ti:
gpu_burn.txt (3.0 KB)

Also, when I ran with the rt kernel, it was jittery in-game with both cards. Without the RT kernel, it was smooth (maybe with a little more latency) until about 20-25 minutes in then it became like a slideshow.

Edit #2:

I ran vkmark for an extended period of time and performance was very poor after it ran for a while. Running it for a short period of time, it performed well:

Long run:

[vertex] device-local=true: FPS: 1 FrameTime: 1000.000 ms
[vertex] device-local=false: FPS: 1 FrameTime: 1000.000 ms
[texture] anisotropy=0: FPS: 1 FrameTime: 1000.000 ms
[texture] anisotropy=16: FPS: 1 FrameTime: 1000.000 ms
[shading] shading=gouraud: FPS: 1 FrameTime: 1000.000 ms
[shading] shading=blinn-phong-inf: FPS: 1 FrameTime: 1000.000 ms
[shading] shading=phong: FPS: 1 FrameTime: 1000.000 ms
[shading] shading=cel: FPS: 833 FrameTime: 1.200 ms
[effect2d] kernel=edge: FPS: 1616 FrameTime: 0.619 ms
[effect2d] kernel=blur: FPS: 1447 FrameTime: 0.691 ms
[desktop] : FPS: 1672 FrameTime: 0.598 ms

                               vkmark Score: 60

=======================================================

Short/regular run:

=======================================================
vkmark 2017.08

Vendor ID:      0x10DE
Device ID:      0x1F06
Device Name:    NVIDIA GeForce RTX 2060 SUPER
Driver Version: 2160869696
Device UUID:    b6d58e5649bb414e24aeac1fd76647c2

=======================================================
[vertex] device-local=true: FPS: 1667 FrameTime: 0.600 ms
[vertex] device-local=false: FPS: 1598 FrameTime: 0.626 ms
[texture] anisotropy=0: FPS: 1631 FrameTime: 0.613 ms
[texture] anisotropy=16: FPS: 1631 FrameTime: 0.613 ms
[shading] shading=gouraud: FPS: 1647 FrameTime: 0.607 ms
[shading] shading=blinn-phong-inf: FPS: 1646 FrameTime: 0.608 ms
[shading] shading=phong: FPS: 1645 FrameTime: 0.608 ms
[shading] shading=cel: FPS: 1647 FrameTime: 0.607 ms
[effect2d] kernel=edge: FPS: 1595 FrameTime: 0.627 ms
[effect2d] kernel=blur: FPS: 1430 FrameTime: 0.699 ms
[desktop] : FPS: 1609 FrameTime: 0.622 ms
[cube] : FPS: 1669 FrameTime: 0.599 ms
[clear] : FPS: 1620 FrameTime: 0.617 ms

                               vkmark Score: 1618

=======================================================

Edit #3: Tested with 515 kernel and nVIDIA 510 but unfortunately experienced the same choppiness after ~20 minutes.
nvidia510_vkmark.txt (14.1 KB)
nvidia-bug-report.log.gz (1.2 MB)

I really think there’s something wrong with your mainboard. Did you already try usinf the second pcie slot?
Already tried using a different monitor cable/connector?

@generix The interesting thing is that my brother-in-law suffers from the same performance issues. Still seeing all the interrupts that are constantly showing up on both mine and my brother-in-laws. I switched the PCIe port and same behavior. Also updated to the latest version that was just released and same behavior.

The issue I was having with CSGO, I was able to resolve. I posted on the Github CSGO Issue 2901 with what I did to fix it.

Are there any hardware similarities betwenn the systems (mainboard, nvidia gpu model/brand)?

@generix Not really because they are completely different generations. Both are Asus motherboards and Intel processors.

My brother-in law: Mainboard is Asus, Intel i5-2500k, nVIDIA GPU is an Asus GTX 1650 OC Edition (DUAL-GTX1650-O4GD6-MINI)
Mine: Mainboard is Asus (Z690 Hero), Intel i7-12700KF, nVIDIA GPU is an eVGA RTX 3080Ti FTW3 ULTRA GAMING 12GB (12G-P5-3967-KB)