Retraining ssd_mobilenet on Jetson nano

colin.gaffney53 · February 4, 2021, 10:59pm

On running Jetson AI Fundamentals retraining the above network, i run train_ssd.py for a couple of epochs, convert to onnx format without problems but when I come to output the results into the test directory I find there are no bounding boxes on the images nor are there any labels. I have followed Dusty’s instructions to the letter but I am at a loss understanding why this is happening.
Any ideas would be welcome.

Regards
Colin

dusty_nv · February 5, 2021, 1:44am

Hi @colin.gaffney53, how many epochs did you train it for? I think the default is 30. Here is the fruits model that I trained for 100 epochs:

https://nvidia.box.com/shared/static/gq0zlf0g2r258g3ldabl9o7vch18cxmi.gz

When you run the model with detectnet program, you can also decrease the detection threshold using the --threshold argument (i.e. --threshold=0.25). The default is 0.5

colin.gaffney53 · February 11, 2021, 1:02pm

Hi Dusty

I’ve been tinkering around with various things but still not had much success.

I trained the network for 30 epochs and used the threshold value of 0.25 in detectnet as you suggested. The only images identified correctly were the strawberries. 4 other images were labelled as apples and that was the lot. No bounding boxes or labels on any of the others. Only 7 out of 20 were labelled but of those just 3 correctly.

It’s not a big deal but I’m curious as to why it’s not working as it should do.

Regards
Colin

dusty_nv · February 11, 2021, 2:10pm

Try deleting the .engine file in your model’s directory, then re-run detectnet program. If you trained the model again, TensorRT may still be loading the old .engine file.

Also, try using this model that I trained for 100 epochs, just to see if it works for you: https://nvidia.box.com/shared/static/gq0zlf0g2r258g3ldabl9o7vch18cxmi.gz

colin.gaffney53 · February 19, 2021, 4:44pm

Hi Dusty

I loaded the completed model that you sent me and it worked perfectly.

I deleted the engine file as recommended and reran detectnet. There was some improvement in detection: nowhere near as good as your model but reasonable.

I also found that there was some ‘ghosting’ of the bounding boxes with sometimes 3 or 4 boxes superimposed on each image. Each box showing a different percentage of success. Is this a consequence of a lowering the detection threshold to 0.25?

Regards

Colin

dusty_nv · March 4, 2021, 3:11pm

Sorry for the delay - yes, lowering the detection threshold can also produce spurious detections.

Hopefully you have been good luck training other models with the pytorch-ssd code. I wonder if your fruits model was down to a different random initialization from mine.

colin.gaffney53 · March 5, 2021, 8:49pm

Thanks for your reply. I am having further problems when trying to use my own dataset.
After following your instructions regarding camera-capture and creating a directory with a labels file, I captured many images but when I came to train the network i got the following error message

Traceback (most recent call last):
File “train_ssd.py”, line 243, in
target_transform=target_transform, is_test=True)
File “/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py”, line 33, in init
raise IOError(“missing ImageSet file {:s}”.format(image_sets_file))
TypeError: unsupported format string passed to PosixPath.format

I can’t understand why I have received this message. Any help would be appreciated.

Regards
Colin

dusty_nv · March 5, 2021, 9:02pm

What files do you have under ImageSets/Main? If you only have one file under there (i.e. train.txt), can you copy it so there are the following files:

+ ImageSets
   + Main
      - test.txt
      - train.txt
      - trainval.txt
      - val.txt

colin.gaffney53 · March 7, 2021, 2:03pm

Hi Dusty

That worked fine, training went ahead without any problems.

I only had train.txt and trainval.txt in the folder so I copied to make the other two files as you suggested. Why were the files missing in the first place?

One last point that I can’t resolve; when trying to resume training and using the argument —resume=saved checkpoint, I find that training begins again at epoch 0. I’ve deleted the saved checkpoint at epoch 0 but this has no effect.

Regards
Colin

dusty_nv · March 8, 2021, 6:03pm

If you check the ‘Merge Sets’ button in the camera-capture tool, it will automatically duplicate your annotations across train/val/test sets and create those other files. Otherwise you’ll need to collect train/val/test sets so that those other files are created for the other sets.

Merging the sets isn’t advised for making production-quality models, where you want independent train/val/test sets, but it is fine for just playing around with (and saves you time from collecting additional test/val images that are separate from test)

When you resume, it doesn’t know the previous epoch number, so it restarts at 0. I recommend using a new --model-dir when you resume training so that you keep the checkpoints separate for clarity.

colin.gaffney53 · March 11, 2021, 10:27pm

Thanks for your all your help Dusty but I’m still not getting this restore business.

My understanding is that when you supply the last saved checkpoint (which in my case is epoch 29) to the restore option, this allows the training to continue from where it left off. I would expect it would then go on to epoch 30, 31 etc until it was finished.

I ran the training again following your advice and saved it into another model directory. However the training began exactly as before from epoch 0.

Am I missing something here?

dusty_nv · March 11, 2021, 10:53pm

It doesn’t save/parse the last epoch that it ended on, so when you restart it, it thinks it is starting on epoch 0 again. However since it is loading weights that you already trained, it is in a better starting position than if it truly started from scratch.

The classification training code does save the previous epoch number into the checkpoint, but it isn’t something I’ve gotten around to implementing into the pytorch-ssd code that I forked and used for this tutorial.

Topic		Replies	Views
The retrained ssd_mobilenetv1 does not detect anything and the labels.txt file is not found Jetson Nano jetson-inference	4	668	January 24, 2022
Jetson inference retraining SSD-Mobilenet Jetson Nano tensorrt , ai-training	4	1600	November 17, 2021
Jetson Nano - train_ssd.py example doesn't detect anything Jetson Nano ai-training	8	1682	October 15, 2021
Accuracy with re-training SSD network from Jetson Inference Jetson Nano ai-training	4	698	February 24, 2022
Prediction Bounding Box Jetson Nano tensorrt , jetson-inference , training	6	1152	October 15, 2021
Not able to train ssd-mobilenet! Jetson Nano ai-training	9	945	October 18, 2021
Jetson nano - train model for my own object detection Jetson Nano ai-training	11	4593	October 15, 2021
Bad accurancy , re-trained the SSD network for people detection #1172 Jetson Nano ai-training	4	629	October 10, 2021
Hello AI World - new object detection training and video interfaces Jetson Nano	29	4694	April 20, 2021
Issue When Running Re-trained SSD Mobilenet Model in Script Jetson Nano ssd , ai-training	2	981	October 18, 2021

Retraining ssd_mobilenet on Jetson nano

Related topics