Overlapping boxes after Faster RCNN Inference

Hi!

I followed the notebook of TLT 3.0 with Faster RCNN and I have some issues with the inference boxes that are overlapping for the same object. Here are some results




As you can see, my goal here is to detect properly moths in the image. All theses moths are from the same big familly called Noctuidae.
Instead of using the datas suggested in the notebook, I only used my own data annotated manually in KITTI format (3500 images). Here are my config files :
default_spec_resnet18_retrain_spec_custom.txt (4.2 KB)
default_spec_resnet18_custom.txt (3.9 KB)
Here is an example of the KITTI format and its photo used for the training :
0-noctuidae.txt (57 Bytes)


and here is the result after inference :
0-noctuidae.txt (201 Bytes)

Here is the outptut of the evaluation after pruning :
100%|█████████████████████████████████████████| 465/465 [00:16<00:00, 28.43it/s]

Class AP precision recall RPN_recall

noctuidae 0.9784 0.1633 0.9808 0.9915

mAP@0.5 = 0.9784

We used Resnet 18 like the notebook in the examples folder.
So my questions are :

  1. How can we have one big box instead of multiple small boxes ?
  2. Is there a parameter (maybe threshold for inference boxes) that would help ?
  3. As we can see sometimes, it looks like the model is only trying to detect wings instead of the whole insect, is it because butterflies are symetric and because one wing looks like a butterfly from the side ?
  4. Should we use transfer learning and use purpose-built models for this task ? And if yes, which one ?
  5. My goal is also to automate labeling process, so I also inputed already marked images from the train set into the test set and it looks like the model tries to put more boxes than required on these images. However theses additional boxes have less confidence.

Thanks for your answers !

From the training spec file, I can see that you want to train a 1248x384 network. First of all, do you resize all your training image/labels into this resolution(1248x384)?
More, how about the average resolution of your training images? Please train a model whose size is closed to the average resolution of images.