Jetson TX2 rebooting by itself

I’ve been testing some PyTorch algorithms on a freshly updated Jetson TX2 dev kit (JetPack 4.2.2), but I’ve been encountering issues regarding random reboots. Most of the time it happens in mid of training, when saving a model or even before it starts training (just after executing the script).

I’m with both nvpmodel 3 and jetson_clocks enabled. However, the same happened when with nvpmodel 0 and jetson_clocks enabled.

First, I thought it could be a GPU overload of some sort, so I monitored the GPU usage with the code provided at this link ( https://devtalk.nvidia.com/default/topic/974063/jetson-tx1/caffe-failed-with-py-faster-rcnn-demo-py-on-tx1/post/5010194/ ).
Based on the following output (from right before it rebooted, when I had just executed a training routine, but didn’t even start to train), I don’t think it is memory-related, at least not to the GPU’s.

GPU memory usage: used = 3020.32, free = 4839.71 MB, total = 7860.04 MB
GPU memory usage: used = 3026.94, free = 4833.1 MB, total = 7860.04 MB
GPU memory usage: used = 3023.36, free = 4836.67 MB, total = 7860.04 MB
GPU memory usage: used = 3023.24, free = 4836.79 MB, total = 7860.04 MB
GPU memory usage: used = 3023.12, free = 4836.91 MB, total = 7860.04 MB
GPU memory usage: used = 3023.15, free = 4836.88 MB, total = 7860.04 MB
GPU memory usage: used = 3023.15, free = 4836.88 MB, total = 7860.04 MB

Based on this link ( https://devtalk.nvidia.com/default/topic/1042139/jetson-tx2/jetson-tx2-reset-powerdown-issue/ ), I got some outputs that could lead to the reason why this is happening.

nvidia@nvidia:~ cat /proc/device-tree/chosen/reset/pmc-reset-reason/reset-level 1 nvidia@nvidia:~ cat /proc/device-tree/chosen/reset/pmc-reset-reason/reset-source
MAINSWRST
nvidia@nvidia:~ cat /proc/device-tree/chosen/reset/pmic-reset-reason/reason NIL_OR_MORE_THAN_1_BIT nvidia@nvidia:~ cat /proc/device-tree/chosen/reset/pmic-reset-reason/register-value
0x00

If anyone knows what could be going on, help would be deeply appreciated.
Thanks!

Is your system running out of RAM and the OOM killer nuking a critical process?

How can I check if that is happening?

You might find evidence of this in /var/log/syslog. You could also run your system with the serial console connected and you’ll see the kernel output.

This probably isn’t your problem though. The OOM algorithm, I think, would choose something not very critical that is using a lot of RAM.

After provoking a couple of resets, I couldn’t find anything that could tell me what the problem was in syslog.

However, I bring new info. I usually don’t plug a monitor to the Jetson, since there isn’t always one available, but I plugged one and provoked the resetting to study it.

Turns out a message saying “Low memory warning!! Memory available for new process: 48MB, free 76MB, buffers/cache 139MB” showed up on top of the screen as soon as the system froze. Then, the screen faded, and a few seconds later the Jetson rebooted.

I believe it is related to how I’m loading the data. I’ll try to change the code and see if it helps.

Thanks anyway!!