The machine was hang and restarted during running train,py in section "Re-training on the Cat/Dog Dataset"

Hi all,

During running train,py in section “Re-training on the Cat/Dog Dataset”, the machine was hang and restarted suddenly. I tried to change the batch-size from 4 to 1, the situation is still the same. May I ask what should I do to diagnose? Thank you.

The command I use is:
“python3 train.py --model-dir=models/cat_dog --batch-size=1 --workers=1 --epochs=1 data/cat_dog”

Regards,
Anthony

Hi,

Could you monitor the memory with tegratstats on another console to see if any OOM issue?

$ sudo tegratstats

Thanks.

Thank you so much, @AastaLLL . I tried to allocate the SWAP and it seems that the issue was solved. However, when I tried to train my own model, by following the ‘tools’ model in the video. The issue came again. Therefore, I ran your command to capture the information. Please find the attachment ‘tegrastats.txt’ for detail.

The command I run is:
"python3 train.py --model-dir=models/anthony_data1 -batch-size=1 --workers=1 --epochs=1 data/anthony_data1

Thanks again!
Anthony

tegrastats.txt (20 KB)

Hi @sangwong416, when you mounted swap, did you also disable ZRAM as per these instructions?

You can also try disabling the desktop GUI like the instructions cover there.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.