Jetson-inference: Retraining cat_dog using train.py is not running

Hi,

I have build jetson-inference and trying to retrain cat-dog model.

When I run the below command, it starts training and I can see the first epoch completed print on the screen. After that, in about a minute the jetson nano unit shutdowns.

command:-
cd jetson-inference/python/training/classification
python3.6 train.py --model-dir=cat_dog ~/datasets/cat_dog

Pytorch version: 1.3.0
Torchvision:- 0.5.0
Tensorrt:- 6.0.1.10
Tensorflow:-1.13.1
Jetson sdk: 4.3

This issue happening always. I need to restart the unit. If I set the epoch’s default count to 1 instead of 35, then I can see this completely running and generating the model. So if the train.py runs for about more than a minute the system shutdowns.

Let me know what is wrong here and the solution for this. Thank you.

Regards,
Shankar

Hi Shankar, can you run tegrastats in the background to keep an eye on the memory usage during training? If the process is consuming all RAM and swap memory, you may need to mount additional swap space (see here)

The other possibility that could be occurring is that the board is shutting down due to the power supply. Which power supply are you using, and could you try one of those listed here (ideally a 5V⎓4A DC barrel jack adapter)

Hi Dusty_nv,

Tegrastats before running and after 30seconds of running training. This is the one i could capture before it shutdown. Let me know if this memory increase is the reason?

Before:

RAM 1292/3956MB (lfb 397x4MB) SWAP 0/3002MB (cached 0MB) CPU [13%@102,11%@102,14%@102,21%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@27C CPU@28.5C iwlwifi@33C PMIC@100C GPU@27.5C AO@37C thermal@28C POM_5V_IN 1890/2345 POM_5V_GPU 41/86 POM_5V_CPU 123/463

After 30seconds of running:

RAM 3353/3956MB (lfb 107x4MB) SWAP 346/3002MB (cached 1MB) CPU [19%@102,22%@102,18%@102,10%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@27.5C CPU@29C iwlwifi@33C PMIC@100C GPU@28C AO@37.5C thermal@28.5C POM_5V_IN 1931/2805 POM_5V_GPU 82/230 POM_5V_CPU 123/565

And regarding the power adapter, I am using the one with 9V / 1Amp.

Regards,
Shankar

The memory usage looks ok, because there is still sufficient swap space remaining.

Which specific power adapter are you using? Nano should use a 5V adapter, for training at least 2.5A (ideally the 4A barrel jack adapter)

Another thing you can try to determine if the issue is power-related, is by setting your Nano to 5W mode by running the command “sudo nvpmodel -m 1” The training will take longer, but if it runs continuously this means your power adapter isn’t able to consistently supply the needed current under load of the max performance mode. In that case, you should try a different power adapter.

Hi Dusty,

I tried using usb power adapter (5V and 4amp), with this also seeing the same behaviour.

If I need to run this training on host PC (windows), can you share me the list of packages and steps to do this?

Regards,
Shankar

Do you see the same behavior when running your Nano in 5W mode? (sudo nvpmodel -m 1) If it runs fine in 5W mode, it would indicate a power supply issue under sustained load, and you should try the DC barrel jack adapter instead of USB power adapter.

You would need to install PyTorch for Windows (along with CUDA Toolkit and cuDNN for Windows), and you would want to use my torchvision fork (v0.3.0 branch) found here: https://github.com/dusty-nv/vision/tree/v0.3.0

1 Like

Hi Dusty,

With USB power adapter (5V/4amp) and keeping the unit at 5W power mode helped. Thank you.
Tried upto 5 epoch’s training…its working. Not tried beyond this for now.

Regards,
Shankar

OK gotcha, in that case that would point to some issue with the USB power supply. It could be that power supply is unable to consistently deliver under sustained load, or perhaps its cable doesn’t use thick enough gauge wiring which caused the voltage to droop under load. In any case, I recommend you pick up one of these DC barrel jack adapters along with a jumper:

https://www.adafruit.com/product/1466

Then you should be able to return your Nano to 10W mode (sudo nvpmodel -m 0) for maximum performance.