Jetson nano - train model for my own object detection

Hi
i tried to follow the guideline (jetson-inference/pytorch-collect-detection.md at master · dusty-nv/jetson-inference · GitHub) to train my own object detection model
After collecting the images of different classes, i run the command below:
python3 train_ssd.py --dataset-type=voc --data=data/phone --model-dir=models/phone --batch-size=2 --workers=1 --epochs=1

and here is my log with error (ran out of input)

Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=2, checkpoint_folder=‘models/phone’, dataset_type=‘voc’, datasets=[‘data/phone’], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones=‘80,100’, momentum=0.9, net=‘mb1-ssd’, num_epochs=1, num_workers=1, pretrained_ssd=‘models/mobilenet-v1-ssd-mp-0_675.pth’, resume=None, scheduler=‘cosine’, t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
Prepare training datasets.
VOC Labels read from file: (‘BACKGROUND’, ‘APPLE’, ‘HUAWEI’)
Stored labels into file models/phone/labels.txt.
Train dataset size: 314
Prepare Validation datasets.
VOC Labels read from file: (‘BACKGROUND’, ‘APPLE’, ‘HUAWEI’)
Validation dataset size: 39
Build network.
Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
Traceback (most recent call last):
File “train_ssd.py”, line 311, in
net.init_from_pretrained_ssd(args.pretrained_ssd)
File “/jetson-inference/python/training/detection/ssd/vision/ssd/ssd.py”, line 119, in init_from_pretrained_ssd
state_dict = torch.load(model, map_location=lambda storage, loc: storage)
File “/usr/local/lib/python3.6/dist-packages/torch/serialization.py”, line 580, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File “/usr/local/lib/python3.6/dist-packages/torch/serialization.py”, line 750, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input

Would you please help identify what could be the possible root causes? thanks.

Hi @lzlallen1980, it appears that your base model (models/mobilenet-v1-ssd-mp-0_675.pth) may not have downloaded correctly or is corrupt. Can you try re-running this?

$ cd jetson-inference/python/training/detection/ssd
$ wget https://nvidia.box.com/shared/static/djf5w54rjvpqocsiztzaandq1m3avr7c.pth -O models/mobilenet-v1-ssd-mp-0_675.pth
1 Like

Hi @dusty_nv
Thanks a lot, the problem is solved by re-download the base model. May I know how to use the --resume flag in train_ssd.py and train.py because I don’t want to re-train my model from scratch.

I checked the --help but it doesn’t work for me by executing the following command:
train.py →
–model-dir=models/phone data/phone --resume=modles/phone/checkpoint.pth.tar --start-epoch=30

train_ssd.py →
python3 train_ssd.py --dataset-type=voc --data=data/phone --model-dir=models/phone --batch-size=2 --workers=1 --resume=models/phone

(p.s i cannot find the Checkpoint state_dict file after execute train_ssd.py)

Is it maybe that there is a typo in --resume=modles/phone/checkpoint.pth.tar (should be --resume=models/phone/checkpoint.pth.tar) ?

Note that when you are restarting, the number of epochs to train is relative to --start-epoch. Since the default number of classification training epochs is 35, it would train for 5 more epochs in your case. If you want to increase the number of epochs, use the --epochs=<N> argument.

You need to point --resume to a specific checkpoint, not the entire models directory.

Thanks @dusty_nv
The problem of classification is solved, but for object detection (train_ssd.py), what I got after the training process are lots of .pth file with “loss” in the file name.
In this case, can I use the lowest loss file as the parameter for the –resume if I want to retrain the model based on the best result from the last training round?

-rwxrwxrwx 1 root root 23 Okt 27 14:59 labels.txt
-rwxrwxrwx 1 root root 27082289 Okt 27 15:03 mb1-ssd-Epoch-0-Loss-5.173318338394165.pth
-rwxrwxrwx 1 root root 27082657 Okt 27 14:32 mb1-ssd-Epoch-0-Loss-5.181716239452362.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 15:32 mb1-ssd-Epoch-10-Loss-4.57422052025795.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 15:34 mb1-ssd-Epoch-11-Loss-4.3036856889724735.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 15:37 mb1-ssd-Epoch-12-Loss-3.8889721393585206.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 15:39 mb1-ssd-Epoch-13-Loss-4.22403473854065.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 15:41 mb1-ssd-Epoch-14-Loss-4.558090430498123.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 15:44 mb1-ssd-Epoch-15-Loss-4.365999013185501.pth
-rwxrwxrwx 1 root root 27082285 Okt 27 15:46 mb1-ssd-Epoch-16-Loss-3.660797029733658.pth
-rwxrwxrwx 1 root root 27082289 Okt 27 15:48 mb1-ssd-Epoch-17-Loss-4.379636001586914.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 15:50 mb1-ssd-Epoch-18-Loss-4.020171248912812.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 15:52 mb1-ssd-Epoch-19-Loss-4.088538992404938.pth
-rwxrwxrwx 1 root root 27082289 Okt 27 15:06 mb1-ssd-Epoch-1-Loss-4.0988994359970095.pth
-rwxrwxrwx 1 root root 27082290 Okt 27 15:55 mb1-ssd-Epoch-20-Loss-4.448697566986084.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 15:57 mb1-ssd-Epoch-21-Loss-4.52673025727272.pth
-rwxrwxrwx 1 root root 27082289 Okt 27 16:00 mb1-ssd-Epoch-22-Loss-3.93531693816185.pth
-rwxrwxrwx 1 root root 27082285 Okt 27 16:02 mb1-ssd-Epoch-23-Loss-3.6904357612133025.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:04 mb1-ssd-Epoch-24-Loss-4.123652732372284.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 16:07 mb1-ssd-Epoch-25-Loss-3.5936761081218718.pth
-rwxrwxrwx 1 root root 27082290 Okt 27 16:09 mb1-ssd-Epoch-26-Loss-4.15365492105484.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 16:11 mb1-ssd-Epoch-27-Loss-4.30386244058609.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 16:13 mb1-ssd-Epoch-28-Loss-3.892255371809006.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 16:15 mb1-ssd-Epoch-29-Loss-4.065651631355285.pth
-rwxrwxrwx 1 root root 27082290 Okt 27 15:09 mb1-ssd-Epoch-2-Loss-4.154015421867371.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:17 mb1-ssd-Epoch-30-Loss-4.001027858257293.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 16:19 mb1-ssd-Epoch-31-Loss-4.378226900100708.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:22 mb1-ssd-Epoch-32-Loss-3.9157397031784056.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 16:24 mb1-ssd-Epoch-33-Loss-3.9947033286094666.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:26 mb1-ssd-Epoch-34-Loss-4.177697479724884.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:28 mb1-ssd-Epoch-35-Loss-3.998384600877762.pth
-rwxrwxrwx 1 root root 27082289 Okt 27 16:30 mb1-ssd-Epoch-36-Loss-3.5488924205303194.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 16:32 mb1-ssd-Epoch-37-Loss-4.240429884195327.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 16:34 mb1-ssd-Epoch-38-Loss-3.8180229544639586.pth
-rwxrwxrwx 1 root root 27082285 Okt 27 16:36 mb1-ssd-Epoch-39-Loss-4.163911831378937.pth
-rwxrwxrwx 1 root root 27082289 Okt 27 15:12 mb1-ssd-Epoch-3-Loss-5.932780969142914.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:38 mb1-ssd-Epoch-40-Loss-4.105100697278976.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 16:40 mb1-ssd-Epoch-41-Loss-4.260145890712738.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:42 mb1-ssd-Epoch-42-Loss-4.006967744231224.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:45 mb1-ssd-Epoch-43-Loss-3.618067067861557.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 16:46 mb1-ssd-Epoch-44-Loss-3.935636246204376.pth
-rwxrwxrwx 1 root root 27082290 Okt 27 16:49 mb1-ssd-Epoch-45-Loss-3.751825910806656.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 16:51 mb1-ssd-Epoch-46-Loss-4.232554602622986.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 16:53 mb1-ssd-Epoch-47-Loss-4.179930770397187.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 16:55 mb1-ssd-Epoch-48-Loss-4.08715660572052.pth
-rwxrwxrwx 1 root root 27082285 Okt 27 16:57 mb1-ssd-Epoch-49-Loss-3.726569575071335.pth
-rwxrwxrwx 1 root root 27082290 Okt 27 15:15 mb1-ssd-Epoch-4-Loss-4.2940898656845095.pth
-rwxrwxrwx 1 root root 27082287 Okt 27 15:18 mb1-ssd-Epoch-5-Loss-5.1472421169281.pth
-rwxrwxrwx 1 root root 27082285 Okt 27 15:21 mb1-ssd-Epoch-6-Loss-3.9237293541431426.pth
-rwxrwxrwx 1 root root 27082290 Okt 27 15:23 mb1-ssd-Epoch-7-Loss-4.046696293354034.pth
-rwxrwxrwx 1 root root 27082288 Okt 27 15:26 mb1-ssd-Epoch-8-Loss-3.512920105457306.pth
-rwxrwxrwx 1 root root 27082286 Okt 27 15:29 mb1-ssd-Epoch-9-Loss-4.293288427591324.pth
-rwxrwxrwx 1 root root 27110270 Okt 27 18:17 ssd-mobilenet.onnx
-rwxrwxrwx 1 root root 16500486 Okt 27 18:27 ssd-mobilenet.onnx.1.1.7103.GPU.FP16.engine

You could pick either the model with the lowest loss, or the latest epoch. Sometimes even though the latest epoch may not have the lowest loss, the loss may be on it’s way down again soon in a future epoch.

In any case, I would backup these models to a different directory, because when you resume training the epoch count may be set back to 0 again. So storing each training run in it’s own directory will help you keep the models straight.

Hello,
I am having the same problem. How did you re-download the base model?

Hi @khy_14, the steps are found here: https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-ssd.md#setup

$ cd jetson-inference/python/training/detection/ssd
$ wget https://nvidia.box.com/shared/static/djf5w54rjvpqocsiztzaandq1m3avr7c.pth -O models/mobilenet-v1-ssd-mp-0_675.pth

Hi @dusty_nv
Thank for the rely.
The problem appeared when I followed the “Building the Project From Source”
after the re-download the problem is not solved.
But when I followed the “Running the Docker Container”, no problems appear and it works fine.
I just cannot figure out the difference between the two instruction.

Thanks

Hmm, both ways would be using the same copy of the base model, because that directory is mapped into the container. In that case, I’m not sure what the issue may be, perhaps it is related to the PyTorch install on the host. Since the container is working for you correctly, I would recommend to continue using the container for the PyTorch training at least.

Thanks for the quick rely
I will try to reinstall the PyTorch and let you know if everything is ok.