Used the pascalvoc dataset to train with detectnet_V2, but the accuracy is low

I trained with Detectnet_V2(resnet18) using the pascalvoc2012 dataset, but the accuracy is low.
Is there any way to improve the accuracy?

The results of the train are as follows

Validation cost: 0.000081
Mean average_precision (in %): 9.7308

class name      average precision (in %)
------------  --------------------------
aeroplane                     27.6491
bicycle                        4.44473
bird                           0.0382677
boat                           0.357756
bottle                         2.17242
bus                           32.6177
car                           10.8047
cat                           30.4376
chair                          1.27091
cow                            0.566813
diningtable                    1.32209
dog                           17.4295
horse                          1.16366
motorbike                     14.0428
person                        35.1342
pottedplant                    0
sheep                          2.62056
sofa                           4.63491
train                          7.70868
tvmonitor                      0.199589

Median Inference Time: 0.008577
2022-05-30 07:16:02,292 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 0.741
2022-05-30 07:16:03,093 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 0.741
Time taken to run iva.detectnet_v2.scripts.train:main: 3:03:38.935997.

• Training spec file
detectnet_v2_train_resnet18_voc.txt (23.4 KB)

• TLT Version
tlt-streamanalytics:v2.0_py3

• Network Type
Detectnet_v2(resnet18)

output_image_width: 512
output_image_height: 400

Did you resize the training images/labels to 512x400?

No, I am not resizing.

The document says this

If the output image height and the output image width of the preprocessing block doesn’t match with the dimensions of the input image, the dataloader either pads with zeros, or crops to fit to the output resolution. It does not resize the input images and labels to fit.

If the input image is not sized correctly, you would crop it or either pads with zeros, right?
Therefore, we did not think it was necessary to necessarily resize all images in advance.

I apologize if my understanding is incorrect

For detectnet_v2 network, it is needed to resize images/labels offline.

see DetectNet_v2 — TAO Toolkit 3.22.02 documentation,

The train tool does not support training on images of multiple resolutions, or resizing images during training. All of the images must be resized offline to the final training size and the corresponding bounding boxes must be scaled accordingly.

I’ll resize it to the same resolution in advance and try the train.

Do I need to resize the validation dataset in the same way?

Yes, it is needed.

I trained with images resized to 496x400 in pascalvoc dataset in advance, but the accuracy is still low.

Validation cost: 0.000105
Mean average_precision (in %): 21.7120

class name      average precision (in %)
------------  --------------------------
aeroplane                       40.2799
bicycle                          1.86966
bird                            26.0981
boat                            17.6851
bottle                           4.11341
bus                             52.4331
car                             10.6682
cat                             33.5806
chair                           18.1309
cow                              9.06108
diningtable                      6.90725
dog                             40.0285
horse                           33.3164
motorbike                       14.472
person                          44.0251
pottedplant                     11.3704
sheep                            3.42361
sofa                            13.4291
train                           53.3468
tvmonitor                        0

Median Inference Time: 0.005719
2022-06-01 08:29:52,196 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 1.680
2022-06-01 08:29:53,013 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 1.680
Time taken to run iva.detectnet_v2.scripts.train:main: 2:54:01.435409.

• Training spec file
detectnet_v2_train_resnet18_voc.txt (23.4 KB)

Is there anything I can do to improve accuracy?

  1. If resized to 496x400, please check the resolution of each objects. Detectnet_v2 may not be able to detect objects that are smaller than 16x16 pixels.
  2. See Frequently Asked Questions — TAO Toolkit 3.22.05 documentation

Following parameters can help you improve AP on smaller objects:

  • Increase num_layers of resnet
  • class_weight for small objects
  • Increase the coverage_radius_x and coverage_radius_y parameters of the bbox_rasterizer_config section for the small objects class
  • Decrease minimum_detection_ground_truth_overlap
  • Lower minimum_height to cover more small objects for evaluation.
  1. Please use yolo_v4_tiny network instead.

Thank you very much.
I will try 1 and 2 first.
By the way, do you have any experience with high accuracy when training pascalvoc with Detectnet_V2?

No, we did not have a baseline for Pascalvoc with detectnet_v2.

Thank you very much.
Let me ask one more question.

In tvmonitor class, AP is still 0. What could be the cause?
I checked the area of the bbox and it is about 130px on average for both width and height.
There are also 412 labels in the training.

Could you try to setup another experiment? Just to train one class: tvmonitor.

When we trained only the tvmonitor class, the accuracy increased.

Validation cost: 0.000035
Mean average_precision (in %): 34.6166

class name      average precision (in %)
------------  --------------------------
tvmonitor                        34.6166

Median Inference Time: 0.003489
2022-06-07 02:00:03,813 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 6.655
2022-06-07 02:00:04,613 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 6.655
Time taken to run iva.detectnet_v2.scripts.train:main: 1:46:26.249857.

Besides above suggestions, please consider more for the experiments.

  1. Try to use larger backbone. For example, resnet50, vgg19, etc.
  2. Set minimum_bounding_box_height to 1
  3. Set all the minimum_height and minimum_width to 20
  4. Set all the classes to the same
            name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
  1. VOC is an imbalance dataset, see Frequently Asked Questions — TAO Toolkit 3.22.05 documentation
    Distribute the dataset class: How do I balance the weight between classes if the dataset has significantly higher samples for one class versus another?

To account for imbalance, increase the class_weight for classes with fewer samples. You can also try disabling enable_autoweighting; in this case initial_weight is used to control cov/regression weighting. It is important to keep the number of samples of different classes balanced, which helps improve mAP.

  1. Try to finetune batch-size. For example, 8, 4.

  2. Try to finetune learning rate. For example, max_lr: 1.25e-4 min_lr=1.25e-5