Jetson nano - train model for my own object detection

ryan_sg · October 26, 2020, 10:13am

Hi
i tried to follow the guideline (jetson-inference/pytorch-collect-detection.md at master · dusty-nv/jetson-inference · GitHub) to train my own object detection model
After collecting the images of different classes, i run the command below:
python3 train_ssd.py --dataset-type=voc --data=data/phone --model-dir=models/phone --batch-size=2 --workers=1 --epochs=1

and here is my log with error (ran out of input)

Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=2, checkpoint_folder=‘models/phone’, dataset_type=‘voc’, datasets=[‘data/phone’], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones=‘80,100’, momentum=0.9, net=‘mb1-ssd’, num_epochs=1, num_workers=1, pretrained_ssd=‘models/mobilenet-v1-ssd-mp-0_675.pth’, resume=None, scheduler=‘cosine’, t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
Prepare training datasets.
VOC Labels read from file: (‘BACKGROUND’, ‘APPLE’, ‘HUAWEI’)
Stored labels into file models/phone/labels.txt.
Train dataset size: 314
Prepare Validation datasets.
VOC Labels read from file: (‘BACKGROUND’, ‘APPLE’, ‘HUAWEI’)
Validation dataset size: 39
Build network.
Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
Traceback (most recent call last):
File “train_ssd.py”, line 311, in
net.init_from_pretrained_ssd(args.pretrained_ssd)
File “/jetson-inference/python/training/detection/ssd/vision/ssd/ssd.py”, line 119, in init_from_pretrained_ssd
state_dict = torch.load(model, map_location=lambda storage, loc: storage)
File “/usr/local/lib/python3.6/dist-packages/torch/serialization.py”, line 580, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File “/usr/local/lib/python3.6/dist-packages/torch/serialization.py”, line 750, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input

Would you please help identify what could be the possible root causes? thanks.

dusty_nv · October 26, 2020, 6:30pm

Hi @ryan_sg, it appears that your base model (models/mobilenet-v1-ssd-mp-0_675.pth) may not have downloaded correctly or is corrupt. Can you try re-running this?

$ cd jetson-inference/python/training/detection/ssd
$ wget https://nvidia.box.com/shared/static/djf5w54rjvpqocsiztzaandq1m3avr7c.pth -O models/mobilenet-v1-ssd-mp-0_675.pth

ryan_sg · October 27, 2020, 6:56am

Hi @dusty_nv
Thanks a lot, the problem is solved by re-download the base model. May I know how to use the --resume flag in train_ssd.py and train.py because I don’t want to re-train my model from scratch.

I checked the --help but it doesn’t work for me by executing the following command:
train.py →
–model-dir=models/phone data/phone --resume=modles/phone/checkpoint.pth.tar --start-epoch=30

train_ssd.py →
python3 train_ssd.py --dataset-type=voc --data=data/phone --model-dir=models/phone --batch-size=2 --workers=1 --resume=models/phone

(p.s i cannot find the Checkpoint state_dict file after execute train_ssd.py)

dusty_nv · October 27, 2020, 5:57pm

Is it maybe that there is a typo in --resume=modles/phone/checkpoint.pth.tar (should be --resume=models/phone/checkpoint.pth.tar) ?

Note that when you are restarting, the number of epochs to train is relative to --start-epoch. Since the default number of classification training epochs is 35, it would train for 5 more epochs in your case. If you want to increase the number of epochs, use the --epochs=<N> argument.

You need to point --resume to a specific checkpoint, not the entire models directory.

ryan_sg · October 28, 2020, 8:42am

Thanks @dusty_nv
The problem of classification is solved, but for object detection (train_ssd.py), what I got after the training process are lots of .pth file with “loss” in the file name.
In this case, can I use the lowest loss file as the parameter for the –resume if I want to retrain the model based on the best result from the last training round?

-rwxrwxrwx 1 root root 23 Okt 27 14:59 labels.txt
-rwxrwxrwx 1 root root 27082289 Okt 27 15:03 mb1-ssd-Epoch-0-Loss-5.173318338394165.pth
-rwxrwxrwx 1 root root 27082657 Okt 27 14:32 mb1-ssd-Epoch-0-Loss-5.181716239452362.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 15:32 mb1-ssd-Epoch-10-Loss-4.57422052025795.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 15:34 mb1-ssd-Epoch-11-Loss-4.3036856889724735.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 15:37 mb1-ssd-Epoch-12-Loss-3.8889721393585206.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 15:39 mb1-ssd-Epoch-13-Loss-4.22403473854065.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 15:41 mb1-ssd-Epoch-14-Loss-4.558090430498123.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 15:44 mb1-ssd-Epoch-15-Loss-4.365999013185501.pth
-rwxrwxrwx 1 root root 27082285 Okt 27 15:46 mb1-ssd-Epoch-16-Loss-3.660797029733658.pth
-rwxrwxrwx 1 root root 27082289 Okt 27 15:48 mb1-ssd-Epoch-17-Loss-4.379636001586914.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 15:50 mb1-ssd-Epoch-18-Loss-4.020171248912812.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 15:52 mb1-ssd-Epoch-19-Loss-4.088538992404938.pth
-rwxrwxrwx 1 root root 27082289 Okt 27 15:06 mb1-ssd-Epoch-1-Loss-4.0988994359970095.pth
-rwxrwxrwx 1 root root 27082290 Okt 27 15:55 mb1-ssd-Epoch-20-Loss-4.448697566986084.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 15:57 mb1-ssd-Epoch-21-Loss-4.52673025727272.pth
-rwxrwxrwx 1 root root 27082289 Okt 27 16:00 mb1-ssd-Epoch-22-Loss-3.93531693816185.pth
-rwxrwxrwx 1 root root 27082285 Okt 27 16:02 mb1-ssd-Epoch-23-Loss-3.6904357612133025.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:04 mb1-ssd-Epoch-24-Loss-4.123652732372284.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 16:07 mb1-ssd-Epoch-25-Loss-3.5936761081218718.pth
-rwxrwxrwx 1 root root 27082290 Okt 27 16:09 mb1-ssd-Epoch-26-Loss-4.15365492105484.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 16:11 mb1-ssd-Epoch-27-Loss-4.30386244058609.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 16:13 mb1-ssd-Epoch-28-Loss-3.892255371809006.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 16:15 mb1-ssd-Epoch-29-Loss-4.065651631355285.pth
-rwxrwxrwx 1 root root 27082290 Okt 27 15:09 mb1-ssd-Epoch-2-Loss-4.154015421867371.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:17 mb1-ssd-Epoch-30-Loss-4.001027858257293.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 16:19 mb1-ssd-Epoch-31-Loss-4.378226900100708.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:22 mb1-ssd-Epoch-32-Loss-3.9157397031784056.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 16:24 mb1-ssd-Epoch-33-Loss-3.9947033286094666.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:26 mb1-ssd-Epoch-34-Loss-4.177697479724884.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:28 mb1-ssd-Epoch-35-Loss-3.998384600877762.pth
-rwxrwxrwx 1 root root 27082289 Okt 27 16:30 mb1-ssd-Epoch-36-Loss-3.5488924205303194.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 16:32 mb1-ssd-Epoch-37-Loss-4.240429884195327.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 16:34 mb1-ssd-Epoch-38-Loss-3.8180229544639586.pth
-rwxrwxrwx 1 root root 27082285 Okt 27 16:36 mb1-ssd-Epoch-39-Loss-4.163911831378937.pth
-rwxrwxrwx 1 root root 27082289 Okt 27 15:12 mb1-ssd-Epoch-3-Loss-5.932780969142914.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:38 mb1-ssd-Epoch-40-Loss-4.105100697278976.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 16:40 mb1-ssd-Epoch-41-Loss-4.260145890712738.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:42 mb1-ssd-Epoch-42-Loss-4.006967744231224.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:45 mb1-ssd-Epoch-43-Loss-3.618067067861557.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:46 mb1-ssd-Epoch-44-Loss-3.935636246204376.pth
-rwxrwxrwx 1 root root 27082290 Okt 27 16:49 mb1-ssd-Epoch-45-Loss-3.751825910806656.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 16:51 mb1-ssd-Epoch-46-Loss-4.232554602622986.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 16:53 mb1-ssd-Epoch-47-Loss-4.179930770397187.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 16:55 mb1-ssd-Epoch-48-Loss-4.08715660572052.pth
-rwxrwxrwx 1 root root 27082285 Okt 27 16:57 mb1-ssd-Epoch-49-Loss-3.726569575071335.pth
-rwxrwxrwx 1 root root 27082290 Okt 27 15:15 mb1-ssd-Epoch-4-Loss-4.2940898656845095.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 15:18 mb1-ssd-Epoch-5-Loss-5.1472421169281.pth
-rwxrwxrwx 1 root root 27082285 Okt 27 15:21 mb1-ssd-Epoch-6-Loss-3.9237293541431426.pth
-rwxrwxrwx 1 root root 27082290 Okt 27 15:23 mb1-ssd-Epoch-7-Loss-4.046696293354034.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 15:26 mb1-ssd-Epoch-8-Loss-3.512920105457306.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 15:29 mb1-ssd-Epoch-9-Loss-4.293288427591324.pth
-rwxrwxrwx 1 root root 27110270 Okt 27 18:17 ssd-mobilenet.onnx
-rwxrwxrwx 1 root root 16500486 Okt 27 18:27 ssd-mobilenet.onnx.1.1.7103.GPU.FP16.engine

dusty_nv · October 28, 2020, 4:18pm

You could pick either the model with the lowest loss, or the latest epoch. Sometimes even though the latest epoch may not have the lowest loss, the loss may be on it’s way down again soon in a future epoch.

In any case, I would backup these models to a different directory, because when you resume training the epoch count may be set back to 0 again. So storing each training run in it’s own directory will help you keep the models straight.

khy_14 · April 25, 2021, 1:16pm

Hello,
I am having the same problem. How did you re-download the base model?

dusty_nv · April 26, 2021, 1:59pm

Hi @khy_14, the steps are found here: https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-ssd.md#setup

$ cd jetson-inference/python/training/detection/ssd
$ wget https://nvidia.box.com/shared/static/djf5w54rjvpqocsiztzaandq1m3avr7c.pth -O models/mobilenet-v1-ssd-mp-0_675.pth

khy_14 · April 26, 2021, 3:01pm

Hi @dusty_nv
Thank for the rely.
The problem appeared when I followed the “Building the Project From Source”
after the re-download the problem is not solved.
But when I followed the “Running the Docker Container”, no problems appear and it works fine.
I just cannot figure out the difference between the two instruction.

Thanks

dusty_nv · April 26, 2021, 3:40pm

Hmm, both ways would be using the same copy of the base model, because that directory is mapped into the container. In that case, I’m not sure what the issue may be, perhaps it is related to the PyTorch install on the host. Since the container is working for you correctly, I would recommend to continue using the container for the PyTorch training at least.

khy_14 · April 26, 2021, 7:07pm

Thanks for the quick rely
I will try to reinstall the PyTorch and let you know if everything is ok.

Topic		Replies	Views
Jetson Nano 2GB Killed (Out Of Memory) During Re-Training Jetson Nano ai-training	20	3374	November 22, 2021
Train custom object detectio model Jetson Nano ai-training	12	3232	October 18, 2021
How does resume work when a model fails to train (Following "Re-training SSD-Mobilenet" tutorial"] Jetson Nano jetson-inference	3	989	October 15, 2021
Data corruption when running train_ssd script Jetson Nano python , training	10	1067	September 12, 2022
Jetson-inference resume training ssd Jetson Nano jetson-inference	3	634	January 30, 2024
Training ssd-mobilenet on jetson nano from scratch Jetson Nano ai-training	4	1715	October 18, 2021
Jetson AI Fundamentals - S3E3 - Training Image Classification Models Jetson Nano jetson-inference	6	818	February 21, 2023
Re-training SSD-Mobilenet: gt_locations consist of nan values which causing Regression Loss to NaN Jetson Nano ai-training	2	977	September 13, 2022
Using re-trained model inside Python script Jetson Nano ai-training	14	2201	October 15, 2021
Train_ssd.py dosen't work with pascal voc dataset Jetson Nano ai-training	5	1213	February 9, 2022

Jetson nano - train model for my own object detection

Related topics