Very low precision while Training detectnet_v2 model using custom data in TAO

Hi I am trying to train detectnet_v2 model for object detection using custom data but i am getting very low Map = 25.2992
also some of the factors to note are unlike demo dataset which has only single kind of class in single image some of my images have multiple classes in single image.
please find my training config file in attachments.

• Hardware GeForce RTX 4070 Ti
• Network Type (Detectnet_v2)
• TLT Version (format_version: 2.0, toolkit_version: 4.0.1)
• Training spec file(
detectnet_v2_train_resnet18_kitti.txt (8.6 KB)
any help will be appreciable.

How about the average resolution of the objects in your training images? Are they small?
And are all the training images the same resolution?

Well some images have objects which are small (by average resolution i assume you mean area covered by object in that image)while some are big overall i would say it is saturated (even combination of small and big objects not sure though) dataset. The dataset is combination of many live videos from different areas of india also some images are from open dataset and videos with different camera angle are also used.
Hope this information is useful.
Also when i trained with dataset for custom yolo_v5 model i got accuracy of around 80-85% .

Since training images are not of the same resolution, for detectnet_v2, refer to DetectNet_v2 - NVIDIA Docs, please set the enable_auto_resize parameter to true in the augmentation_config module of the spec file.

And there are small images, refer to Frequently Asked Questions - NVIDIA Docs ,

In DetectNet_V2, are there any parameters that can help improve AP (average precision) on training small objects?

Following parameters can help you improve AP on smaller objects:

  • Increase num_layers of resnet
  • class_weight for small objects
  • Increase the coverage_radius_x and coverage_radius_y parameters of the bbox_rasterizer_config section for the small objects class
  • Decrease minimum_detection_ground_truth_overlap
  • Lower minimum_height to cover more small objects for evaluation.

well all my images are resized to 640 x 640 pixel as mentioned inside the document then converted to kitti and finally to TFRECORDS.
Also like u have mentioned if i tweak this parameters will it give wrong results for big objects of same class in say other images.
Also while visualising in i saw some images with only pixels in them can you please help me understand why is that?
please check this two images that i have added.


some other images also

Can you run “tao detectnet_v2 inference” and share the result?
More, you can try to run below experiments.

  • Use deeper backbone, like reset50
  • Train only one class and check the difference.

And you mentioned that “resized to 640 x 640 pixel”, do you resize the label accordingly?
Actually, you did not need to resize offline, you can use original images/labels to run training by setting enable_auto_resize: true.
More, for detectnet_v2, it may not be able to detect objects that are smaller than 16x16 pixels.

I did resize the labels accordingly.
i am trying with vgg19 backbone now should i use resnet50?
sorry i did not run inference and deleted the notebook but i have wandbai report for when i train and retrain the pruned model which i will share with you.
detectnet_v2_resnet18_retrain_pruned_model.pdf (13.9 MB)
detectnet_v2_resnet18_train.pdf (12.7 MB)

It is fine to go ahead with vgg19.

1 Like

Results after 60 epochs are
Validation cost: 0.000395
Mean average_precision (in %): 23.7055

class name average precision (in %)

bus 11.6197
car 44.7707
motorbike 16.605
rickshaw 41.9241
truck 3.60784

My dataset contains 6098 training images and 744 test images is it very small dataset??

After revisiting the images, most of the objects have resolution of lower than 50x50. And also, there are different data distribution, for example, some are with night background and some are with daylight background, etc. For training these small objects with these different scenes, you can use above guide mentioned above. Or you can use YOLOv4_tiny to train instead.

Here i have taken VGG19 as backbone, Well evaluate result for trained model before pruning are:
Validation cost: 0.001066
Mean average_precision (in %): 29.0707

class name average precision (in %)

bus 19.3519
car 47.8369
motorbike 19.6476
rickshaw 47.7149
truck 10.8022

should i go for pruning model which to me does not make any sense?
i am sharing few images of my dataset of size 640x640, which i think may help better understand the scenario. (8.4 MB)

Now here are some questions i would like to ask.

  • please have a look at my train config file.
    detectnet_v2_train_resnet18_kitti.txt (8.6 KB)

  • Should i continue with VGG19 backbone?

  • if yes, can i increase number of layers from 19 to something drastically high say 80-90?

  • should i go with RESNET50 backbone as suggested earlier as it has more layers and still increase layers?

  • should i go for YOLOv4 or YOLOv4_tiny as suggested if yes then please elaborate more on the same.

I had selected detectnet_v2 because it is nvidias own model and since i am going to deploy my model in to deepstream, on jetson xavier nx, for four ip cameras installed on traffic signals, i need faster FPS with better accuracy, currently i have used Deepsort-YOLOv5 wich has around 85% accuracy but the application is very slow, there is lot of frame drop and low FPS and i am sure using deepstream will surely adress this issues.

From the dataset, indeed, the scenario varies a lot. I suggest you to go for YOLOv4_tiny.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.