Reboot when training with pytorch

Tesla V100 32G
ubuntu 20.04
pytorch 1.8
cuda 11.2

My computer is restarting when training object detection with yolov5. The size of the picture I used is 2048 and resize to 1280. The training set is 13,000 and the verification set is 1,500. Typically, after a few epochs, execution on the validation set will cause a system reboot.
Gpu memory during training is sufficient…
nvidia-bug-report.log.gz (538.4 KB)