CVAT dataset split for train_ssd.py

gdefender · January 25, 2022, 5:10pm

I have been using CVAT to annotate data for train_ssd.py. I was wondering if I should worry about the train/test/val data sets. I don’t see a way differentiate or split them in CVAT so I assume I am getting one data set for all three.

Should I attempt to split the dataset with torch.utils.data.random_split ?

dusty_nv · January 25, 2022, 6:36pm

Hi @gdefender, if you are going for a production-quality model, then yes it may be prudent to have independent train/val/test sets for training. Typical use-cases for train_ssd.py are for testing & development (and educational learning) so it’s not a huge deal just to re-use the training set for those. Folks may find the TAO Toolkit good for training production-quality models.

Anyways, in Pascal VOC-style datasets all of the data is intermixed, and the train/val/test splits are dictated by the text files under ImageSets/. So you could make a little Python script that randomly generated these text files from the master ‘train.txt’. Or as you suggested, you could attempt to modify the train_ssd.py source to use torch.utils.data.random_split. I would probably do the first way to remain consistent with how Pascal VOC datasets are and to not have to do further debugging.

Hope that helps!

gdefender · January 26, 2022, 6:19am

Thanks for the fast response. I am an adult mentor working with a FIRST Robotics Competition high-school team. We are attempting to use jetson-inference to identify red and blue balls for our robot to pick up with a Jetson Nano. Our most powerful nvidia hardware is a Jeton TX2, so TAO is out for training at this time.
I am hopeful we will be able to get a reliable enough model with train_ssd.py for our application. We are having mixed success, but we’re making progress!

dusty_nv · January 26, 2022, 6:06pm

OK gotcha - in your situation, to be honest I would just use all of the training data that you have in the train set and not worry about the splits. This will let you use all of your custom-annotated data for training the model. CVAT will output a default.txt for the ImageSet and train_ssd.py knows to use this file for train/val/test.

Wish you and your team the best of luck this season in FIRST!

system · February 23, 2022, 5:15am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.