my HW
PowerEdge C4130
2xTesla M60 (NVIDIA-SMI 418.40.04 Driver Version: 418.40.04 CUDA Version: 10.1)
centos7 with kernel 3.10.0-957.10.1.el7.x86_64
actually 44 running jobs with ffmpeg
load over nvidia-smi is 50-60%

Every 3 days server unexpectly reboot without any error or something warning, few min before restart, load_average show load in milions (top is 9.89 milion) and server restart without any warning. This problem reapeats after update and add more ffmpeg jobs to encoding.

Is any limit into GPU or have nvidia-hw some protect for overload?

btw: For me looks like linux_oops protection but can’t see error…

thanks for any inspire answer :)

