Jetson-inference: Retraining cat_dog using train.py is not running

shankarprasadhm1 · January 6, 2020, 5:24pm

Hi,

I have build jetson-inference and trying to retrain cat-dog model.

When I run the below command, it starts training and I can see the first epoch completed print on the screen. After that, in about a minute the jetson nano unit shutdowns.

command:-
cd jetson-inference/python/training/classification
python3.6 train.py --model-dir=cat_dog ~/datasets/cat_dog

Pytorch version: 1.3.0
Torchvision:- 0.5.0
Tensorrt:- 6.0.1.10
Tensorflow:-1.13.1
Jetson sdk: 4.3

This issue happening always. I need to restart the unit. If I set the epoch’s default count to 1 instead of 35, then I can see this completely running and generating the model. So if the train.py runs for about more than a minute the system shutdowns.

Let me know what is wrong here and the solution for this. Thank you.

Regards,
Shankar

dusty_nv · January 6, 2020, 5:41pm

Hi Shankar, can you run tegrastats in the background to keep an eye on the memory usage during training? If the process is consuming all RAM and swap memory, you may need to mount additional swap space (see here)

The other possibility that could be occurring is that the board is shutting down due to the power supply. Which power supply are you using, and could you try one of those listed here (ideally a 5V⎓4A DC barrel jack adapter)

shankarprasadhm1 · January 7, 2020, 3:16pm

Hi Dusty_nv,

Tegrastats before running and after 30seconds of running training. This is the one i could capture before it shutdown. Let me know if this memory increase is the reason?

Before:

RAM 1292/3956MB (lfb 397x4MB) SWAP 0/3002MB (cached 0MB) CPU [13%@102,11%@102,14%@102,21%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@27C CPU@28.5C iwlwifi@33C PMIC@100C GPU@27.5C AO@37C thermal@28C POM_5V_IN 1890/2345 POM_5V_GPU 41/86 POM_5V_CPU 123/463

After 30seconds of running:

RAM 3353/3956MB (lfb 107x4MB) SWAP 346/3002MB (cached 1MB) CPU [19%@102,22%@102,18%@102,10%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@27.5C CPU@29C iwlwifi@33C PMIC@100C GPU@28C AO@37.5C thermal@28.5C POM_5V_IN 1931/2805 POM_5V_GPU 82/230 POM_5V_CPU 123/565

And regarding the power adapter, I am using the one with 9V / 1Amp.

Regards,
Shankar

dusty_nv · January 7, 2020, 3:47pm

The memory usage looks ok, because there is still sufficient swap space remaining.

Which specific power adapter are you using? Nano should use a 5V adapter, for training at least 2.5A (ideally the 4A barrel jack adapter)

Another thing you can try to determine if the issue is power-related, is by setting your Nano to 5W mode by running the command “sudo nvpmodel -m 1” The training will take longer, but if it runs continuously this means your power adapter isn’t able to consistently supply the needed current under load of the max performance mode. In that case, you should try a different power adapter.

shankarprasadhm1 · January 10, 2020, 3:17pm

Hi Dusty,

I tried using usb power adapter (5V and 4amp), with this also seeing the same behaviour.

If I need to run this training on host PC (windows), can you share me the list of packages and steps to do this?

Regards,
Shankar

dusty_nv · January 10, 2020, 3:22pm

Do you see the same behavior when running your Nano in 5W mode? (sudo nvpmodel -m 1) If it runs fine in 5W mode, it would indicate a power supply issue under sustained load, and you should try the DC barrel jack adapter instead of USB power adapter.

You would need to install PyTorch for Windows (along with CUDA Toolkit and cuDNN for Windows), and you would want to use my torchvision fork (v0.3.0 branch) found here: https://github.com/dusty-nv/vision/tree/v0.3.0

shankarprasadhm1 · January 11, 2020, 5:47pm

Hi Dusty,

With USB power adapter (5V/4amp) and keeping the unit at 5W power mode helped. Thank you.
Tried upto 5 epoch’s training…its working. Not tried beyond this for now.

Regards,
Shankar

dusty_nv · January 11, 2020, 10:37pm

OK gotcha, in that case that would point to some issue with the USB power supply. It could be that power supply is unable to consistently deliver under sustained load, or perhaps its cable doesn’t use thick enough gauge wiring which caused the voltage to droop under load. In any case, I recommend you pick up one of these DC barrel jack adapters along with a jumper:

https://www.adafruit.com/product/1466

Then you should be able to return your Nano to 10W mode (sudo nvpmodel -m 0) for maximum performance.

Topic		Replies	Views
Jetson-inference: cannot train model with custom data set Jetson Nano jetson-inference	11	1958	March 9, 2022
Impossible to train/test/inference in GPU Jetson Nano Jetson TX2 jetson-inference	2	860	August 20, 2021
Jetson nano sometimes extremely slow with GPU Jetson Nano cuda , pytorch	7	1075	November 3, 2023
I have met power down when run pytorch transfer learning using Jetson nano. Jetson Nano	2	485	October 18, 2021
trying to do yolo on Jetson nano,but resulting nano shut down everytime due to memory problem. Jetson Nano	6	2310	October 18, 2021
Error message when executing train_ssd.py command Jetson Nano jetson-inference	7	1422	October 15, 2021
The machine was hang and restarted during running train,py in section "Re-training on the Cat/Dog Dataset" Jetson Nano jetson-inference , python	4	354	July 11, 2022
Thumbs Project, please help Jetson Nano	8	1119	October 14, 2021
Hello AI World Training Cat/Dog Jetson Nano ai-training	7	479	March 29, 2024
PLEASE HELP: nvidia Jetson 2GB training fails - TypeError: __init__() missing 1 required positional argument: 'dtype' Jetson Nano ai-training	6	2382	March 2, 2022

Jetson-inference: Retraining cat_dog using train.py is not running

Related topics