Retraining ssd_mobilenet on Jetson nano

On running Jetson AI Fundamentals retraining the above network, i run train_ssd.py for a couple of epochs, convert to onnx format without problems but when I come to output the results into the test directory I find there are no bounding boxes on the images nor are there any labels. I have followed Dusty’s instructions to the letter but I am at a loss understanding why this is happening.
Any ideas would be welcome.

Regards
Colin

Hi @colin.gaffney53, how many epochs did you train it for? I think the default is 30. Here is the fruits model that I trained for 100 epochs:

https://nvidia.box.com/shared/static/gq0zlf0g2r258g3ldabl9o7vch18cxmi.gz

When you run the model with detectnet program, you can also decrease the detection threshold using the --threshold argument (i.e. --threshold=0.25). The default is 0.5

Hi Dusty

I’ve been tinkering around with various things but still not had much success.

I trained the network for 30 epochs and used the threshold value of 0.25 in detectnet as you suggested. The only images identified correctly were the strawberries. 4 other images were labelled as apples and that was the lot. No bounding boxes or labels on any of the others. Only 7 out of 20 were labelled but of those just 3 correctly.

It’s not a big deal but I’m curious as to why it’s not working as it should do.

Regards
Colin

Try deleting the .engine file in your model’s directory, then re-run detectnet program. If you trained the model again, TensorRT may still be loading the old .engine file.

Also, try using this model that I trained for 100 epochs, just to see if it works for you: https://nvidia.box.com/shared/static/gq0zlf0g2r258g3ldabl9o7vch18cxmi.gz

Hi Dusty

I loaded the completed model that you sent me and it worked perfectly.

I deleted the engine file as recommended and reran detectnet. There was some improvement in detection: nowhere near as good as your model but reasonable.

I also found that there was some ‘ghosting’ of the bounding boxes with sometimes 3 or 4 boxes superimposed on each image. Each box showing a different percentage of success. Is this a consequence of a lowering the detection threshold to 0.25?

Regards

Colin

Sorry for the delay - yes, lowering the detection threshold can also produce spurious detections.

Hopefully you have been good luck training other models with the pytorch-ssd code. I wonder if your fruits model was down to a different random initialization from mine.

Thanks for your reply. I am having further problems when trying to use my own dataset.
After following your instructions regarding camera-capture and creating a directory with a labels file, I captured many images but when I came to train the network i got the following error message

Traceback (most recent call last):
File “train_ssd.py”, line 243, in
target_transform=target_transform, is_test=True)
File “/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py”, line 33, in init
raise IOError(“missing ImageSet file {:s}”.format(image_sets_file))
TypeError: unsupported format string passed to PosixPath.format

I can’t understand why I have received this message. Any help would be appreciated.

Regards
Colin

What files do you have under ImageSets/Main? If you only have one file under there (i.e. train.txt), can you copy it so there are the following files:

+ ImageSets
   + Main
      - test.txt
      - train.txt
      - trainval.txt
      - val.txt

Hi Dusty

That worked fine, training went ahead without any problems.

I only had train.txt and trainval.txt in the folder so I copied to make the other two files as you suggested. Why were the files missing in the first place?

One last point that I can’t resolve; when trying to resume training and using the argument —resume=saved checkpoint, I find that training begins again at epoch 0. I’ve deleted the saved checkpoint at epoch 0 but this has no effect.

Regards
Colin

If you check the ‘Merge Sets’ button in the camera-capture tool, it will automatically duplicate your annotations across train/val/test sets and create those other files. Otherwise you’ll need to collect train/val/test sets so that those other files are created for the other sets.

Merging the sets isn’t advised for making production-quality models, where you want independent train/val/test sets, but it is fine for just playing around with (and saves you time from collecting additional test/val images that are separate from test)

When you resume, it doesn’t know the previous epoch number, so it restarts at 0. I recommend using a new --model-dir when you resume training so that you keep the checkpoints separate for clarity.

Thanks for your all your help Dusty but I’m still not getting this restore business.

My understanding is that when you supply the last saved checkpoint (which in my case is epoch 29) to the restore option, this allows the training to continue from where it left off. I would expect it would then go on to epoch 30, 31 etc until it was finished.

I ran the training again following your advice and saved it into another model directory. However the training began exactly as before from epoch 0.

Am I missing something here?

It doesn’t save/parse the last epoch that it ended on, so when you restart it, it thinks it is starting on epoch 0 again. However since it is loading weights that you already trained, it is in a better starting position than if it truly started from scratch.

The classification training code does save the previous epoch number into the checkpoint, but it isn’t something I’ve gotten around to implementing into the pytorch-ssd code that I forked and used for this tutorial.