Irq/201-nvidia occupies one cpu 100% when training ai models on arch linux 3080

I’m training AI model for one of my courses, but the process stops at a random epoch for each time, and the training just freezes. I couldn’t even Ctrl-C to stop it. The whole system seems not affected, because I’m using Intel integrated GPU for rendering GUIs. Then I use top only to find irq/201-nvidia running at 100.

System Specs:
OS: Arch Linux

Linux archlinux 5.15.85-1-lts #1 SMP Thu, 22 Dec 2022 06:22:00 +0000 x86_64 GNU/Linux

GPU: GeForce RTX 3080

VGA compatible controller: NVIDIA Corporation GA104M [GeForce RTX 3080 Mobile / Max-Q 8GB/16GB] (rev a1)
Subsystem: Lenovo Device 22e4
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
The machine is Lenovo Thinkpad X1 Extreme Gen 4

CUDA version: 11.7.0
CUDNN version: 8.3.0.98

>> pacman -Qs nvidia
local/cuda 11.7.0-1
NVIDIA’s GPU programming toolkit
local/cudnn 8.3.0.98-1
NVIDIA CUDA Deep Neural Network library
local/egl-wayland 2:1.1.11-2
EGLStream-based Wayland external platform
local/libvdpau 1.5-1
Nvidia VDPAU library
local/nvidia-dkms 525.60.11-1
NVIDIA drivers - module sources
local/nvidia-utils 525.60.11-1
NVIDIA drivers utilities
local/opencl-nvidia 525.60.11-1

p.s. When editing videos in Windows I usually get blue screens. The error code is nvlddmkm.sys, or something containing TDR. (if that info would help)

I have been experiencing this in the last three month, wrestling with different versions of cuda, cudnn and nvidia drivers, with no result.

This is my first time asking for support in forums. If there any more infos that I should support pls let me know.

Thanks in advance!!!

1 Like

same issue with 4090 , irq/201-nvidia up to 100% , anyone have an idea ?