Improving Poor Object Detection Model Accuracy

Hello everyone,

We’ve generated synthetic data using Isaac Sim and trained some DetectNetv2 object detection models, following the official tutorial here: Unfortunately, the object detection models are not performing as well as we had hoped on real-world data. The model performs fine on synthetic images.

We have spent some time trying to improve our synthetic data by setting up the Unity scene and randomization parameters to better match the real-world scene but we still have not had much success. Since the model is being purely trained on synthetic data right now, we figure training with a mix of real world & synthetic data could improve the accuracy of the model. However, we believe we don’t have enough data to mix the datasets. Our synthetic dataset size is 20000 per object and our real-world dataset is small, 200 frames or less per object we want to detect. The size of the synthetic dataset will likely dilute the importance of the real world data. Due to overfitting concerns, we don’t want to train on purely real-world data either.

For reference, here are some example values from the models we’ve trained:

  • Model Type: DetectNetv2 - ResNet18
  • Batch Size: 16
  • Epochs: 120
  • Dataset size: 20000
  • Final Validation Cost: 0.000026
  • mAP on Synthetic Data: 99.6%
  • mAP on Real Data: 23.4%
  • Training Tool Used: Nvidia Transfer Learning Toolkit

Next Steps
Do you have any advice on any steps we can take to improve the accuracy of our object detection models? Any changes we can make on the synthetic data generation side? On the training side, we thought about including the small real-world dataset in the training data but increasing its impact on the training process, perhaps by modifying the loss function to weigh real-world data more than synthetic data. Is that something we could feasibly do using Nvidia’s tools and models? Any other changes we should try on the training side?

Hi @erick,

Thank you for exercising the DetectNetv2 training pipeline with Isaac Sim. Based on the details you have provided, it seems like you have tried your hand at modifying the training scene (lighting conditions are very important) to best match the real data for inference. One suggestion for error analysis is to find the failure cases and outliers on which the model performs poorly, and use these to inspire the training scene. Additionally, I’d suggest gathering data from multiple scenes (not just 1) and adjust the amount of training data per scene based on the error analysis.

As for training on a small set of real data: this would definitely help but importance weighting per sample is currently not exposed through TLT (class-wise weighting however is enabled). If you are seeing a lot of false positives in the real domain, you could train on a set of real images without objects of interest. This way you provide negative samples to the network, and doesn’t incur the cost of labeling since you can just provide empty label files to the training module.

Evaluation of object detection models will be a part of the next Isaac SDK release (coming soon), and will help to compute metrics and visualize outliers on both simulated and real data.