Retraining with imbalanced dataset

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
Ubuntu 20.04.3 LTS, Intel x64, RTX3090.
• Network Type
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
tao info :

Configuration of the TAO Toolkit Instance
dockers: [‘nvidia/tao/tao-toolkit-tf’, ‘nvidia/tao/tao-toolkit-pyt’, ‘nvidia/tao/tao-toolkit-lm’]
format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021

• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I’m using TAO for transfer learning for detect several classes of objects in my scenario:

  • people
  • bicycle
  • custom door sticker

I believe the people and bicycle can be retrieved from public with sufficient amount, but the custom door sticker is collected by myself and would be very limited (say 1000 pictures with 1 target in each).

so my question:

  1. Will imbalanced dataset impact precison in transfer liearning in TAO?
  2. Does TAO has build-in functions to help extract part of data from public big dataset to align with custom small dataset?
    Say I have the PASCAL VOC dataset with 20 classes (or any other well known dataset), I want only extract people and bicycle from it, further more, for these 2 classes, only 1000 samples of each are extracted, by combined with my custom small dataset, to finaly form a balance dataset for re-training.


  1. Yes, for imbalanced dataset, in detectnet_v2 network, refer to Frequently Asked Questions — TAO Toolkit 3.22.05 documentation

Distribute the dataset class: How do I balance the weight between classes if the dataset has significantly higher samples for one class versus another?

To account for imbalance, increase the class_weight for classes with fewer samples. You can also try disabling enable_autoweighting; in this case initial_weight is used to control cov/regression weighting. It is important to keep the number of samples of different classes balanced, which helps improve mAP.

  1. Yes, actually when you run "tao detectnet_v2 dataset_convert ", it will generate some tfrecords files. You can select part of them to combine with your custom small dataset.
  1. still a bit confused
    is the class_weight introduced for imbalance dataset scenario? if yes, what is the guidline to set its value, like for my case, I only got 1000 samples for private data, compare to public dataset, it almost nothing.
    enable_autoweighting will help for my case(very imbalance)? or I have to manually balance the data before training.

  2. tfrecords are binary data, how can I know a file is for which class? then I can only need pick out
    people and bicycle from 20 classes.

  1. Yes, increase the class_weight for classes with fewer samples. The enable_autoweighting cannot help for very imbalance cases.

  2. You can inspect each tfrecord file. You can refer to Tensor reshape error when evaluating a Detectnet_v2 model - #7 by Morganh

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.