Performance drop to slideshow after ~20 minutes of CSGO play with Vulkan

nvidia-bug-report.log.gz (293.4 KB)

When playing CSGO using Vulkan, after about 20 minutes of gameplay, the performance goes to the level of slideshow quality.

Running nVIDIA RTX 3080Ti with NVIDIA-Linux-x86_64-515.48.07.
Linux cyberdeath-pc 5.18.0-1-rt11-MANJARO #1 SMP PREEMPT_RT Sat May 28 15:43:17 CEST 2022 x86_64 GNU/Linux

I have attached my bug report to this thread.

Related Github bug reports placed with Valve: 2901 & 2891

Seeing a lot of these in my logs:

Jun 15 01:32:01 cyberdeath-pc kernel: BUG: scheduling while atomic: irq/219-s-nvidi/685/0x00000003
Jun 15 01:32:01 cyberdeath-pc kernel: Modules linked in: nvidia_uvm(POE) joydev mousedev snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device mc usbhid intel_rapl_msr asus_nb_wmi eeepc_wmi iTCO_wdt intel_pmc_bxt iTCO_vendor_support>
Jun 15 01:32:01 cyberdeath-pc kernel:  snd_timer cec snd i2c_i801 thunderbolt soundcore i2c_smbus igc mei_me mei drm_kms_helper intel_lpss_pci syscopyarea sysfillrect intel_lpss sysimgblt fb_sys_fops idma64 pmt_telemetry agpgart pmt_cl>
Jun 15 01:32:01 cyberdeath-pc kernel: Preemption disabled at:
Jun 15 01:32:01 cyberdeath-pc kernel: [<0000000000000000>] 0x0
Jun 15 01:32:01 cyberdeath-pc kernel: CPU: 3 PID: 685 Comm: irq/219-s-nvidi Tainted: P        W  OE     5.18.0-1-rt11-MANJARO #1
Jun 15 01:32:01 cyberdeath-pc kernel: Hardware name: ASUS System Product Name/ROG MAXIMUS Z690 HERO, BIOS 1505 05/31/2022
Jun 15 01:32:01 cyberdeath-pc kernel: Call Trace:
Jun 15 01:32:01 cyberdeath-pc kernel:  <TASK>
Jun 15 01:32:01 cyberdeath-pc kernel:  dump_stack_lvl+0x44/0x58
Jun 15 01:32:01 cyberdeath-pc kernel:  __schedule_bug.cold+0x81/0x8e
Jun 15 01:32:01 cyberdeath-pc kernel:  __schedule+0xeb5/0x1240
Jun 15 01:32:01 cyberdeath-pc kernel:  ? push_rt_tasks+0x13/0x20
Jun 15 01:32:01 cyberdeath-pc kernel:  ? raw_spin_rq_unlock+0x17/0x60
Jun 15 01:32:01 cyberdeath-pc kernel:  ? rt_mutex_setprio+0x1be/0x480
Jun 15 01:32:01 cyberdeath-pc kernel:  schedule_rtlock+0x1e/0x40
Jun 15 01:32:01 cyberdeath-pc kernel:  rtlock_slowlock_locked+0x3c8/0xe90
Jun 15 01:32:01 cyberdeath-pc kernel:  rt_spin_lock+0x3f/0x60
Jun 15 01:32:01 cyberdeath-pc kernel:  ___slab_alloc.constprop.0+0x83/0x650
Jun 15 01:32:01 cyberdeath-pc kernel:  ? os_acquire_spinlock+0xe/0x20 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  ? _nv034974rm+0xc/0x20 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  ? _raw_spin_unlock_irqrestore+0x23/0x60
Jun 15 01:32:01 cyberdeath-pc kernel:  ? _nv012124rm+0x40/0x90 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  ? _nv039562rm+0x13/0x60 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  ? _nv035680rm+0x19/0xb0 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  kmem_cache_alloc_trace+0x6e/0x1c0
Jun 15 01:32:01 cyberdeath-pc kernel:  nv_post_event+0x95/0x140 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  _nv034983rm+0x59/0x80 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  ? _nv033036rm+0xab/0xc0 [nvidia]
Jun 15 01:32:01 cyberdeath-pc kernel:  ? _nv031255rm+0xf4/0x120 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  nv_post_event+0x95/0x140 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  _nv034983rm+0x59/0x80 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv033036rm+0xab/0xc0 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv031255rm+0xf4/0x120 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv022962rm+0x6b/0xb0 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv022962rm+0x76/0xb0 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv023096rm+0xf47/0x11d0 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv023567rm+0x8c/0x1a0 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv026078rm+0x5e/0xc0 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv010470rm+0x19f/0x310 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv026088rm+0x14f/0x1b0 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? _nv000648rm+0x10b/0x140 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? disable_irq_nosync+0x10/0x10
Jun 14 23:59:11 cyberdeath-pc kernel:  ? rm_isr_bh+0x1c/0x60 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? nvidia_isr_kthread_bh+0x1b/0x40 [nvidia]
Jun 14 23:59:11 cyberdeath-pc kernel:  ? irq_thread_fn+0x1c/0x60
Jun 14 23:59:11 cyberdeath-pc kernel:  ? irq_thread+0xfd/0x1a0
Jun 14 23:59:11 cyberdeath-pc kernel:  ? irq_finalize_oneshot.part.0+0xd0/0xd0
Jun 14 23:59:11 cyberdeath-pc kernel:  ? irq_thread_check_affinity+0xd0/0xd0
Jun 14 23:59:11 cyberdeath-pc kernel:  ? kthread+0x107/0x130
Jun 14 23:59:11 cyberdeath-pc kernel:  ? kthread_complete_and_exit+0x20/0x20
Jun 14 23:59:11 cyberdeath-pc kernel:  ? ret_from_fork+0x1f/0x30
Jun 14 23:59:11 cyberdeath-pc kernel:  </TASK>

Looks like there’s an interrupt storm happening, did this also happen with the 510 driver? Did you already try reseating the nvidia in its pcie slot?

@generix Thanks for the prompt reply and my apologies on the delay. I reinstalled Manjaro again with the latest build as of yesterday. It was less buggy but I also learned that I don’t see any interrupt storms until I install the rt kernel. However, when playing CSGO with vulkan, I’m still getting jitter after about 20 minutes of gameplay. I captured another debug log when it was happening in hopes that it will shed some light on what’s going on. Please see attached.

nvidia-bug-report.log.gz (275.0 KB)

As a quick update: I updated to the latest beta version available from the Vulkan page ( 515.49.05), which did not help and still have stuttering during gameplay.

I also installed the RT kernel again and, while it initially didn’t throw interrupt storms, I now since see that it’s throwing them often again.

uname -a
Linux cyberdeath-pc 5.18.0-1-rt11-MANJARO #1 SMP PREEMPT_RT Sat May 28 15:43:17 CEST 2022 x86_64 GNU/Linux

Including the following:

Let me know if any further information is needed.

The interrupt storm is also happening with the normal desktop preempt kernel, there are just no races happening so you don’t see it in dmesg. Also, not only the nvidia is affected but also usb.
After only 30 minutes of uptime, xhci is at 1.5M interrupts, nvidia at 4M interrupts. Please check /proc/interrupts and maybe disconnect usb accessories one after another to test if you can find some device triggering this.

@generix I ran watch and specifically looked at the xhci as you mentioned. Even with leaving my keyboard and mouse connected, it was not incrementing. However, as soon as I hit a key or moved the mouse, the numbers incremented, particularly with the mouse. I know that the polling rate on this mouse is 1000Hz and I know it’s also high on the keyboard.

The nVIDIA interrupt is incrementing irregardless of any activity on the system.

PCI devices:

                      TYPE            BUS   CLASS  VENDOR  DEVICE   CONFIGS

   Mass storage controller   0000:00:17.0    0106    8086    7ae2         0
                    Bridge   0000:00:1c.0    0604    8086    7ab8         0
     Serial bus controller   0000:00:15.1    0c80    8086    7acd         0
                    Bridge   0000:00:1f.0    0601    8086    7a84         0
   Mass storage controller   0000:02:00.0    0108    144d    a80a         0
                    Bridge   0000:00:01.0    0604    8086    460d         0
   Mass storage controller   0000:08:00.0    0108    144d    a80a         0
  Communication controller   0000:00:16.0    0780    8086    7ae8         0
        Display controller   0000:01:00.0    0300    10de    2208         5
                    Bridge   0000:00:1b.0    0604    8086    7ac0         0
     Serial bus controller   0000:00:1f.5    0c80    8086    7aa4         0
   Mass storage controller   0000:07:00.0    0108    144d    a80a         0
                    Bridge   0000:00:1c.3    0604    8086    7abb         0
     Multimedia controller   0000:00:1f.3    0403    8086    7ad0         0
                    Bridge   0000:00:00.0    0600    8086    4668         0
                    Bridge   0000:00:1c.1    0604    8086    7ab9         0
     Serial bus controller   0000:00:15.2    0c80    8086    7ace         0
                    Bridge   0000:00:1d.4    0604    8086    7ab4         0
        Network controller   0000:06:00.0    0200    8086    15f3         0
     Serial bus controller   0000:00:15.0    0c80    8086    7acc         0
                    Bridge   0000:00:06.0    0604    8086    464d         0
   Mass storage controller   0000:05:00.0    0106    1b21    0612         0
                    Bridge   0000:00:1d.0    0604    8086    7ab0         0
   Mass storage controller   0000:00:0e.0    0104    8086    467f         0
     Multimedia controller   0000:01:00.1    0403    10de    1aef         0
         Memory controller   0000:00:14.2    0500    8086    7aa7         0
     Serial bus controller   0000:00:14.0    0c03    8086    7ae0         0
     Serial bus controller   0000:00:1f.4    0c05    8086    7aa3         0

Signal processing controller 0000:00:0a.0 1180 8086 467d 0

USB devices:

                      TYPE            BUS   CLASS  VENDOR  DEVICE   CONFIGS

                     Mouse      1-4.4:1.0   10503    3057    0001         0
                       Hub        2-8:1.0   10a00    174c    3074         0
       Unclassified device      1-4.1:1.2    0000    1b1c    1b73         0
                       Hub       1-10:1.0   10a00    058f    6254         0
       Unclassified device        1-7:1.6    0000    0b05    1a27         0
     Multimedia controller      1-4.3:1.1    0401    00ff    ff00         0
                  Keyboard      1-4.1:1.0   10800    1b1c    1b73         0
       Unclassified device      1-4.2:1.0    0000    0f1b    1006         0
       Unclassified device      1-4.3:1.2    0000    00ff    ff00         0
                       Hub        1-0:1.0   10a00    1d6b    0002         0
                       Hub        1-4:1.0   10a00    058f    6254         0
                       Hub        1-8:1.0   10a00    174c    2074         0
       Unclassified device        1-5:1.2    0000    0b05    18f3         0
                       Hub        2-0:1.0   10a00    1d6b    0003         0

@generix Here’s another debug report in case that helps.

nvidia-bug-report.log.gz (1.5 MB)

I don’t really know where that’s coming from. Did you already try to reseat the nvidia board in its slot? Please also check with a 5.17 or earlier kernel, if possible.

@generix I did reseat the card and I also was running 5.15 previously. I’ll boot to 5.15 and if I get any different result, will edit this post. Otherwise, it didn’t help. :-(

[cyberdeath-pc cyberdeath]# cat /proc/interrupts | grep ‘xhci|nvidia’
16: 0 0 0 0 0 0 0 0 12601 0 0 0 IR-IO-APIC 16-fasteoi nvidia
189: 0 0 0 0 0 0 0 0 6234 0 0 0 IR-PCI-MSI 327680-edge xhci_hcd
[cyberdeath-pc cyberdeath]# uptime
00:44:58 up 1 min, 3 users, load average: 1.88, 0.86, 0.32
[cyberdeath-pc cyberdeath]# uname -a
Linux cyberdeath-pc 5.15.48-1-MANJARO #1 SMP PREEMPT Thu Jun 16 12:33:56 UTC 2022 x86_64 GNU/Linux

[cyberdeath-pc cyberdeath]# cat /proc/interrupts | grep ‘xhci|nvidia’
16: 0 0 0 0 0 0 0 0 0 0 43336 0 IR-IO-APIC 16-fasteoi nvidia
189: 0 0 0 0 0 0 0 0 13459 0 0 0 IR-PCI-MSI 327680-edge xhci_hcd
[cyberdeath-pc cyberdeath]# uptime
00:38:46 up 3 min, 3 users, load average: 0.44, 0.53, 0.25
[cyberdeath-pc cyberdeath]# uname -a
Linux cyberdeath-pc 5.15.44-1-rt46-MANJARO #1 SMP PREEMPT_RT Mon Jun 6 13:47:12 CEST 2022 x86_64 GNU/Linux

[cyberdeath-pc cyberdeath]# cat /proc/interrupts | grep ‘xhci|nvidia’
16: 0 0 0 0 0 0 0 0 0 336558 0 0 IR-IO-APIC 16-fasteoi nvidia
189: 0 0 0 0 0 0 0 0 97956 0 0 0 IR-PCI-MSI 327680-edge xhci_hcd
[cyberdeath-pc cyberdeath]# uptime
00:33:56 up 18 min, 3 users, load average: 0.31, 0.68, 0.58
[cyberdeath-pc cyberdeath]# uname -a
Linux cyberdeath-pc 5.18.0-1-rt11-MANJARO #1 SMP PREEMPT_RT Sat May 28 15:43:17 CEST 2022 x86_64 GNU/Linux

Are there any other debug tools that would give more insight into the reason for the interrupts?

Unfortunately no, since interrupts are emitted by the hardware in general. I guess you’ll have to go the hard way, testing your components one-by-one, starting with the nvidia board. Please check if it works in another system.