PC restarting when trainning DL model

tyoc213 · February 21, 2021, 12:52am

My computer is restarting when executing deep learning style transfer notebook pytorch 1.8a on 3090 Ubuntu Linux, then thing is that I don’t know how to get an error trace after restart, also it doesnt seem that is something about temperature because I have instaled the sensors and the temperature is low or normal.

Also last try was to test again seting some breakpoints inside vs code and inside the trainning loop, after some epoch, not sure which operation was hit but the computer restarted even when I had some breakpoints that where respected (also that means the temp was in control because running at speed of human checking data and hiting continue to next breakpoint).

I have raised an issue, but they believe it is from the wrapper library (fastai) https://github.com/pytorch/pytorch/issues/51850 I believe it is the combination of latest approved drivers for Ubuntu and compiling from source pytorch 1.8.

So people, do you know how to trace? debug? print or something to do with my program to be able to get the exact place where this restart is launched? it has been pretty hard to pinpoint the place.

generix · February 21, 2021, 12:22pm

Spontaneous restarts or shutdowns are only initiated by the mainboard or psu. Most often due to power instabilities, overcurrent or pcie bus problems.
Please try reseating the gpu in its slot, check psu, try limiting the gpu clocks using nvidia-smi -lgc
Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

tyoc213 · February 22, 2021, 7:17am

The PSU is a v1300-platinum cooler master

Here is the log after a reboot from the problem Im having nvidia-bug-report_.log.gz (352.6 KB)

Can you provide an example of nvidia-smi -lgc 1000,3000? Im not so sure how I will use that? and it will only be valid during current session? or I need to “set it back to normal” after?

generix · February 22, 2021, 8:31am

Maybe
nvidia-smi -lgc 210,1500
this should stay active until reboot or setting different values.
No errors in the logs but that was expected on a spontaneous reboot.

tyoc213 · February 22, 2021, 10:06pm

I see, will try to debug step by step and jump “into” functions to see if I can get the exact line in the trainning loop, the problem is that it seem that is launched after some “epoch”.

But right now used the command of limiting clock to that range and it is trainning correctly on the nb I have provided for transfer learning.

(xla) tyoc213@u:~/Documents/github$ sudo nvidia-smi -lgc 210,1500
[sudo] password for tyoc213: 
GPU clocks set to "(gpuClkMin 210, gpuClkMax 1500)" for GPU 00000000:02:00.0

Warning: persistence mode is disabled on device 00000000:02:00.0. See the Known Issues section of the nvidia-smi(1) man page for more information. Run with [--help | -h] switch to get more information on how to enable persistence mode.
All done.

So, knowing that, what you think could be the problem? or I should just try set bigger and bigger ranges until I hit the error again? and just set it to a “good range” I find on trail and error?

generix · February 22, 2021, 10:22pm

My guess would be a power problem. Looking at the specs of your PSU, it has a single-/multi-rail switch. Please make sure it is switched to single-rail.

tyoc213 · February 23, 2021, 2:30am

Sorry for my bad info, it is a corsair HX1000i, will check what the single rail multi rail you said in this PSU.

Also thanks for the hint, currently running as this sudo nvidia-smi -lgc 210,1800 without problems.