Nvidia-smi hangs with RTX 2070 and driver 440.82

When I was running a deep learning program and the program was saving the PyTorch model, the program was stuck and was not responding. The nvidia-smi command also hangs. Unfortunately, when running nvidia-bug-report.sh, the script also hangs and I’m not sure whether the log file is complete.

The OS version is Ubuntu 14.04.6 LTS and the driver version is 440.82. I am using PyTorch 1.4.0. The card was manufactured by Colorful, and the model is iGame GeForce RTX 2070 Neptune OC (it is water-cooled so I assume the temperature is not a problem). No display was attached since system boot.

The file generated by nvidia-bug-report.sh and dmesg output are attached.

nvidia-bug-report.log (272.9 KB) dmesg.log (141.2 KB)

This happens again today…

In both incidents, the GPU hangs when the program is calling torch.save(), which dumps the model parameters in GPU to local disk. Is it caused by some faults of the PCIe bus?

The main problem seems to be that you’re running out of memory so the driver can’t allocate some and crashes.
Also, please disable Xorg from starting and enable nvidia-persistenced to start on boot and make sure it’s continuously running.