Bad object detection predictions on SSD Mobilenet v2 lite

I am using Dusty’s jetson-inference repository to create a real-time object detection program for a custom dataset.

Background info

Right now I am training and predicting on a PC with a NVIDIA graphics card and CUDA installed. The goal is to put a nice trained model (in ONNX) format on the Jetson Nano, and perform inference with TensorRT optimization. However, the model is not predicting correctly…

I am using the VisDrone2019-DET dataset, which has 8629 images. I have transformed the data to conform to the VOC standards, this is what my dataset looks like:

VisDrone
├── Annotations
│   ├── 000001.xml
│   ├── 000002.xml
│   ├── 000003.xml
│   ├── 000004.xml
│   ├── 000005.xml
├── ImageSets
│   ├── Main
│   │   ├── test.txt
│   │   └── trainval.txt
├── JPEGImages
│   ├── 000001.jpg
│   ├── 000002.jpg
│   ├── 000003.jpg
│   ├── 000004.jpg
│   └── 000005.jpg
└── label.txt

Example of the annotation files:

<annotation>
	<object>
		<name>person</name>
		<bndbox>
			<xmax>374</xmax>
			<xmin>160</xmin>
			<ymax>481</ymax>
			<ymin>217</ymin>
		</bndbox>
		<truncation>0</truncation>
		<occlusion>0</occlusion>
	</object>
	<object>
		<name>person</name>
		<bndbox>
			<xmax>20</xmax>
			<xmin>1</xmin>
			<ymax>98</ymax>
			<ymin>1</ymin>
		</bndbox>
		<truncation>0</truncation>
		<occlusion>0</occlusion>
	</object>
	<object>
		<name>car</name>
		<bndbox>
			<xmax>375</xmax>
			<xmin>310</xmin>
			<ymax>289</ymax>
			<ymin>165</ymin>
		</bndbox>
		<truncation>0</truncation>
		<occlusion>0</occlusion>
	</object>
</annotation>

The data looks really nice in my eyes, although I am suspecting that the data may be too hard for the choice of model. I wanted to post a picture of an annotated example image from the dataset, but the forum won’t allow me to add more than one embedded picture per topic…

Now when I go to training the model, I am executing this:

python train_ssd.py --net mb2-ssd-lite --pretrained-ssd models/mb2-lite.pth --data ../../VisDrone --model-dir models/visdrone_model  --dataset-type voc --epochs 100

I have tried a little to play around with --batch-size and --learning-rate, but when testing different options with lower epochs, it still is pretty bad

This is the average precision of the model (note that it is calculated using the training images):

python eval_ssd.py --net mb2-ssd-lite --trained_model models/visdrone_model/mb2-ssd-lite-Epoch-99-Loss-6.650372860745224.pth --dataset_type voc --dataset ../../VisDrone/ --label_file models/visdrone_model/labels.txt --eval_dir models/visdrone_model_eval

[...redacted...]

Average Precision Per-class:
car: 0.08366060023699787
motor: 0.0036423337837340614
person: 0.0023940388984447654
pedestrian: 0.002341346690480933
awning tricycle: 0.002820879803813936
tricycle: 0.0019855374143438924
bicycle: 0.0004966070280048979
truck: 0.053817137171613544
van: 0.01781248643689646
bus: 0.11825235193326702

Average Precision Across All Classes: 0.02872233193975974

To predict images with run_ssd_example.py, I made one change, since it gave the error RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! That’s why I changed the line:

predictor = create_mobilenetv2_ssd_lite_predictor(net, candidate_size=200)

to:

import torch
predictor = create_mobilenetv2_ssd_lite_predictor(net, candidate_size=200,device=torch.device('cuda:0'))

Then using run_ssd_example.py, I predicted objects in the same image. The prediction looks really bad to me…


To me it looks like the training did some things right, but you can see that it is really not good enough.

Please note that this image processing is done with pytorch, and not with TensorRT. When I use detectnet with TensorRT, I get similar results thought.

Now that I flooded you with information, I have some questions for you:

  • Any ideas for changing the batch-size and learning-rate (and maybe more options) to make the training work better?
  • As you can see, the precision and predictions are really bad, any idea why? I thought maybe the dataset is too hard for the SSD Mobilenet V2 lite. If so, how can I improve on this, while keeping a real-time model?

Let me know if you could use any more information!

This is an example from the training data, with annotations:

Hi,

Sorry for the late update.

You may need some high resolution detector.
In SSD MobileNet, the input image is rescaled into 300x300.
It will be hard for a detector to distinguish a small bounding box.

Thanks.

Dear AastaLLL,

I have tried to train a model with 512x512 now.
It works already a lot better than the other model, but it is still not perfect.
I guess it is not possible with SSD MobileNet to make it better than that (I do not want to add difficult components myself, only code from jetson-inference pytorch-ssd).
I will regard this question as solved!

Thanks for helping!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.