Hi,
I would like to train an SSD array with several classes such as “Human body, Baseball bat, Knife, Hammer”, except that the number of images for the human body class is very large, for all my data, it corresponds in terms of percentage to 97% of all my data. How can I limit the number of images for each class in order to get the maximum number of images of baseball bat, knife and hammer and limit the number of images for the human body class? The --max-images
option returns the maximum number of images of a dataset composed of several classes, so it doesn’t answer my problem.
Thanks.
T.
Hi @theo17300, I just updated the open_images_downloader.py script in commit 8ed84
to accept a new argument --max-annotations-per-class
. This will limit each class to the specified number of bounding boxes, and if a class has less than that number available, all of it’s annotations/images will be used.
This is the stock data without --max-annotations-per-class
applied:
$ python3 open_images_downloader.py --class-names "Human body, Baseball bat, Knife, Hammer" --stats-only
-------------------------------------
'train' set statistics
-------------------------------------
Image count: 45484
Bounding box count: 169801
Bounding box distribution:
Human body: 167639/169801 = 0.99
Baseball bat: 1204/169801 = 0.01
Knife: 834/169801 = 0.00
Hammer: 124/169801 = 0.00
-------------------------------------
'validation' set statistics
-------------------------------------
Image count: 3847
Bounding box count: 6314
Bounding box distribution:
Human body: 6217/6314 = 0.98
Knife: 77/6314 = 0.01
Baseball bat: 19/6314 = 0.00
Hammer: 1/6314 = 0.00
-------------------------------------
'test' set statistics
-------------------------------------
Image count: 11627
Bounding box count: 19027
Bounding box distribution:
Human body: 18763/19027 = 0.99
Knife: 216/19027 = 0.01
Baseball bat: 48/19027 = 0.00
-------------------------------------
Overall statistics
-------------------------------------
Image count: 60958
Bounding box count: 195142
And these are the results using --max-annotations-per-class=10000
:
python3 open_images_downloader.py --data data/body_bat --class-names "Human body, Baseball bat, Knife, Hammer" --stats-only --max-annotations-per-class 10000
2021-05-21 14:05:34 - Limiting 'Human body' in train dataset to: 10000 boxes (7985 images)
2021-05-21 14:05:34 - Limiting 'Human body' in test dataset to: 10000 boxes (7207 images)
2021-05-21 14:05:34 - Total images after limiting annotations per-class: 21031
2021-05-21 14:05:34 - Total boxes after limiting annotations per-class: 28740
-------------------------------------
'train' set statistics
-------------------------------------
Image count: 9771
Bounding box count: 12162
Bounding box distribution:
Human body: 10000/12162 = 0.82
Baseball bat: 1204/12162 = 0.10
Knife: 834/12162 = 0.07
Hammer: 124/12162 = 0.01
-------------------------------------
'validation' set statistics
-------------------------------------
Image count: 3851
Bounding box count: 6314
Bounding box distribution:
Human body: 6217/6314 = 0.98
Knife: 77/6314 = 0.01
Baseball bat: 19/6314 = 0.00
Hammer: 1/6314 = 0.00
-------------------------------------
'test' set statistics
-------------------------------------
Image count: 7409
Bounding box count: 10264
Bounding box distribution:
Human body: 10000/10264 = 0.97
Knife: 216/10264 = 0.02
Baseball bat: 48/10264 = 0.00
-------------------------------------
Overall statistics
-------------------------------------
Image count: 21031
Bounding box count: 28740
So the Human body
class was limited to 10,000 annotations, but all the other classes used all the annotations available. Note that there is also a --balance-data
option to train_ssd.py
which will undersample more frequent labels, which can help with the training.
Good idea to put this parameter ! Thanks you so much @dusty_nv.