390.87 driver produces excessive IRQs with GeForce GT 730

I have a quad-core x86_64 Gentoo system with a GeForce GT 730 card, PCI ID 0f02, which must use the legacy 390.xx driver. The latest version from portage, 390.87, 1/16/19, appears to cause excessive IRQs. The symptom is that every three minutes, htop reports that CPU0 is at 100% for about ten seconds and the GUI freezes for that time period. htop also reports that the kernel thread ksoftirqd/0 is responsible for this. Right now TIME+ in htop reports 4:42.69 for ksoftirqd/0, 0:00.87 for ksoftirqd/2, 0:28.03 for ksoftirqd/3, and 0:00.74 for ksoftirqd/1. (I’m not using the threadirqs boot option.)

Is this likely to be hardware (video card) failure or a bug in the 390.87 driver from portage?

Use
cat /proc/interrupts
to see where the irqs are coming from.
Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/

Here’s my cat /proc/interrupts. Attached is my nvidia-bug-report.log.gz.
nvidia-bug-report.log.gz (114 KB)

The gpu is continously running into errors:

[ 16051.968] (EE) NVIDIA(0): The NVIDIA X driver has encountered an error; attempting to
[ 16051.968] (EE) NVIDIA(0):     recover...
[ 16051.994] (II) NVIDIA(0): Error recovery was successful.

Kernel:

[13997.629960] NVRM: Xid (PCI:0000:01:00): 8, Channel 00000001
[14005.823040] NVRM: Xid (PCI:0000:01:00): 8, Channel 00000001
[14014.016125] NVRM: Xid (PCI:0000:01:00): 8, Channel 00000001
[14022.209279] NVRM: Xid (PCI:0000:01:00): 8, Channel 00000001

Since this only started after 4h in, looks like some thermal defect, broken hardware.

Many thanks. Off to buy a new card.

I bought a new card which has no errors in Xorg.0.log nor any trace of NVRM in /var/log/messages. Clearly I needed a new card but I still get GUI freezes as one of the four ksoftirqd pushes a core to 100%.

Any idea what could be causing this now?

Some ideas crossed my mind. Are you using suspend or hibernate? If so, do keep the power plug connected when you hibernate? How old is your CMOS battery?

Regardless of the answer to the first question, IRQ problems can also occur if the card is not properly situated in the PCI slot or the system lost connection to the card at some point. How much power does your PSU generate and how much does your graphics card require?

Thinking that ksoftirqd/0 is responsible for cpu#0 and the nvidia gpu is using msi on cpu#1, I’m not sure how to make the nvidia driver responsible for this. Looking at what sits on cpu#0, there are two of your nics (even an old tulip as a bridge interface for virtualbox?) using apic, I don’t think that’s very efficient. Maybe take a look at those nics first.

On the motherboard I had two NICs which died so I’m using one on a PCI card. The two dead NICs appeared in /proc/interrupts so I removed the driver as a module in the kernel and now they don’t. For reasons I don’t understand, this seems to have fixed the GUI freezes which were due to the four kernel threads ksoftirqd. I agree that the nvidia driver (with new video card) was not responsible for these freezes.

I’m grateful to generix and HussamT for their advice. Off to install nvidia-drivers-390.116!

Maybe check if you can completely disable the onboard nics in bios. Removing the driver obviously stopped the interrupt storm but you never know what else they’re doing.

Many thanks. I never would have thought of this. Found them both and disabled them in the BIOS.