0 mAP after switching from TAP 4 to TAO 5

Hello! I am using TAO Toolkit 4.0.1 to pretrain the peoplenet model. My dataset consists of images that contain people and an object that I am trying to pretrain the model on.

People in my dataset are labeled with YOLO, and the objects to be detected are labeled manually. Since the model is already pretrained on people, I reduced the “class_weight” for “person” to 0.1. Using TAO Toolkit 4.0.1, this allowed me to achieve mAP of about 90%.

After which I decided to upgrade to a newer version of TAO Toolkit 5.0.0. Using the same config, I get mAP of 0%. Then I tried the following, if I set the “class_weight” for “person” to the initial value of 1.0, you can see how during the training process after the first epoch person_AP = 70%, and after the 5th epoch person_AP = 35%. The model clearly began to degrade rather than learn. What could be the problem?

More about the dataset: initially these are 720x1280 images, then I convert them to 960x544 with paddings (black stripes on the sides). This is necessary, since it exactly matches the input data that will be sent to the model input in the future. I also tried to generate tfrecords for TAO Toolkit 5.0.0 separately using the corresponding container.

The last solution that I am trying now: I noticed that earlier the labels files of my annotation contained values ​​of the int format, i.e.: person 0.0 0 0 590.62 188.62 632.66 432.96 0 0 0 0 0 0 0 . Now I am trying to convert these files to the format: person 0.00 0 0.00 590.62 188.62 632.66 432.96 0.00 0.00 0.00 0.00 0.00 0.00 0.00, hoping that this will help solve the problem.

Perhaps other approaches can be suggested to me here.

• Hardware - Ubuntu 24.04 LTS
• Network Type - Detectnet_v2
• TLT Version - Docker container: nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 / nvidia/tao/tao-toolkit:5.0.5-tf1.15.5
• Training spec file -
spec_train.txt (4.8 KB)

1 Like

For detectnet_v2, please run with 4.0.1 docker. Since we find that it is a regression issue for 5.0. Internal team is still checking on that. It may stem from some of the autotune variables that are set in the DLFW version of TF we use in 5.0.0.

Thanks for the quick reply! I will continue my experiments with TAO 4.0.1. I will soon create a separate topic on another issue regarding inference.