Error: NMI watchdog: BUG: soft lockup - CPU#33 stuck for 22s! [nvidia-smi:135300]

[Issue]
Around 00:37 on 2/14, the following event occurred on our Server.
The system monitoring tool (Zabbix) detected an abnormal increase in CPU usage.
I was able to log in via SSH, but the operation was slow (due to high CPU usage?).
Executing a sudo reboot via SSH did not allow the OS to reboot.
As a countermeasure, the hardware power was forcibly turned off.
After the OS reboot, the system is operating normally.
I found the process started at 00:09 does not finish and remains in the “R” state.
It appears that Zabbix has interrupted the report while waiting for this process to finish.
[Question]
Does anyone know why it happened so we can avoid it happening again?
Our server has been running for 1 year. This is the first time having this error.
And the error went away after restarting.

Please enable the nvidia-persistenced to start on boot, make sure it is continuously running and check if that resolves the issue.

Thank you. I got it.
But I can’t reproduce it .
So I can’t “check if that resolves the issue”.
Do you know the root cause? and how to reproduce it?

When running headless without nvidia-persistenced, the driver will allways init/deinit with a chance of ending in a deadlock at some time. With nvidia-persistenced started, the irq 537 for MSI/MSI-X messages hould be gone.

Thank you. I got it.