Difficulty of Retraining Object Detection Model with Custom Data

I am struggling with training a new object detection model.

What I want to do is detecting mouth (and eyes if possible).

I captured 100 ~ 200 images using the builtin camera on Jetson TX2.
(All training images are very similar…)

And I followed the instruction (https://github.com/dusty-nv/jetson-inference) to train the model.

On DIGITS, Loss seemed to decrease well, however, it did not work well on validation data which is very similar with training data.

What should I do to train a model? Specifically,

  1. Roughly how many images do I need?
  2. Should I change the net itself besides DIGITS parameters? (E.g., strides?)
  3. Should I consider sizes of bounding boxes?
  4. Any useful advice for training with small dataset?

Thank you!

To train a customized DetectNet, ideally your training dataset would consist of several thousand annotated images. You should be able to detect pedestrian or animal-sized objects like from the tutorial without changing the detection sizes in network prototxt, but for smaller objects like faces or facial features, you may want to decrease the dimensions. You can consult the facenet model which was trained on FDDB. I would recommend to try working with FDDB first or this DIGITS pedestrian detection tutorial with KITTI.

I trained a DetectNet model to detect fishes with roughly 3,000 images in the training set. During training, I got mAP as high as 78.6. And the resulting model seemed to work OK on new (previously unseen) images. Check out my blog posts if you are interested.

https://jkjung-avt.github.io/fisheries-dataset/
https://jkjung-avt.github.io/detectnet-training/

Cool result, thanks for sharing!

I added a link to your resource from the wiki. You may want to add a link to your second post at the end of your first post.

Dustin, thanks. I’ve added the link as you suggested.

The rule of thumb is that you need one input image per learned parameter in your network.
If your network has a million parameters (many large convolution kernels with many output channels and such,) that’s likely to be millions of images!

You can of course get good results with fewer training images, but there is significant risk of over-fitting in that case. This is what happens when training gets low loss but validation fails.

To get less over-fitting, try adding a few dropout layers in your network while training, setting dropout in each such layer to perhaps 50% or so.