When I put a heavy workload on my system using three GPUs, sometimes the entire system becomes unresponsive, and I’m getting watchdog: BUG: soft lockup - CPU#17 stuck for 22s!
. My only option is I have to restart the entire system to resolve the issue. It’s important to note that this problem doesn’t occur consistently, and I can’t reliably reproduce it. The workload that triggers this problem involves real-time decoding of video streams from over 50 cameras and running our inference model on each frame. I’ve included the output from journalctl and a bug report for reference.
nvidia-bug-report.log.gz (4.9 MB)
system_error.log (41.5 KB)