I’m training AI model for one of my courses, but the process stops at a random epoch for each time, and the training just freezes. I couldn’t even Ctrl-C to stop it. The whole system seems not affected, because I’m using Intel integrated GPU for rendering GUIs. Then I use top only to find irq/201-nvidia running at 100.
OS: Arch Linux
Linux archlinux 5.15.85-1-lts #1 SMP Thu, 22 Dec 2022 06:22:00 +0000 x86_64 GNU/Linux
GPU: GeForce RTX 3080
VGA compatible controller: NVIDIA Corporation GA104M [GeForce RTX 3080 Mobile / Max-Q 8GB/16GB] (rev a1)
Subsystem: Lenovo Device 22e4
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
The machine is Lenovo Thinkpad X1 Extreme Gen 4
CUDA version: 11.7.0
CUDNN version: 184.108.40.206
>>pacman -Qs nvidia
NVIDIA’s GPU programming toolkit
NVIDIA CUDA Deep Neural Network library
EGLStream-based Wayland external platform
Nvidia VDPAU library
NVIDIA drivers - module sources
NVIDIA drivers utilities
p.s. When editing videos in Windows I usually get blue screens. The error code is nvlddmkm.sys, or something containing TDR. (if that info would help)
I have been experiencing this in the last three month, wrestling with different versions of cuda, cudnn and nvidia drivers, with no result.
This is my first time asking for support in forums. If there any more infos that I should support pls let me know.
Thanks in advance!!!