Construct a balanced dataset which all images in it, the target classes are imbalanced

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
Ubuntu 20, RTX3090.
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Detectnet_v2
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Say there’re total 2 classes in my dataset need to be detect: Apple and Bottle.
The problems is nearly for all the images in my dataset, contains many Apple, and just 1 Bottle, like below:

this is a typical imblanced dataset, and can’t be voided by even adding more data.
I want to know is that possible to base on this dataset, to construct a balanced one by:

  • Is that reasonable to manually omit labeling some of Apples?
    I quite doubt it as the omit class in image would be proposed during validation process in training, and greatly impact the accuracy accessment.

  • crop or dim original dataset images to exclude some Apples?
    then the images become like:

What is the target of your model? To detect only the bottle, right?

Hi Morgan, I updated my post, please help to take look.

You can run detectnet_v2 network with the existing dataset.

To account for imbalance, increase the class_weight for classes with fewer samples. You can also try disabling enable_autoweighting; in this case initial_weight is used to control cov/regression weighting. (Frequently Asked Questions - NVIDIA Docs)

You can also try to run yolov4_tiny network.

It is possible to do that.

It is not suggested to do that since this will destroy the data distribution. It will not look like real images.

when this kind of images are used for training, some of them will be split for validation, as I understand, those Unlabelled Apples area would be proposed as detected objects along with the accuracy increasing of the model, and, the ground truth label for these area does not exists, then the propose will be considered as False Positive, does this matters?

Yes, it matters. To avoid this issue, please make sure all the apples in the validation dataset are labeled. You can generate a separate folder for validation dataset. In validation dataset, do not omit labeling apples.

1 Like

BTW, you can use omniverse or issac to generate amounts of similar dataset and then train a pretrained model. Then use it to finetune your own dataset.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.