When I perform multi-card parallel training on 4-card 4090, it will get stuck or CPU soft lock

I have tested a variety of parallel pytorch programs on many machines, many of which will crash, but some codes will not crash on NGC 22.12-py3
The following are the versions of the driver, cuda, pytorch I have tested
Driver version:525.25.60.13/525.78.01
cuda versin:11.6/11.8/12.0
pytorch version:1.14/2.0/The master version of pytorch compiled with cuda12




I’m not sure if this is a driver or library related issue

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

nvidia-bug-report.log.gz (1.2 MB)
this is my bug report。thanks a million

please help me

please help me

The Xserver is crashing and restarting, please disable it. sudo systemctl disable display-manager
Then please install and set up nvidia-persistenced and check if that resolves your issue.