Dataset Practices

My goal is to detect cardboard boxes, but boxes come in all shapes and sizes are organized differently in each warehouse. If I only trained my model on just one environment of cardboard boxes - it will not generalize well.

Currently, I have training data images from 6 different warehouses that treat cardboard boxes differently, My goal is to generalize well, how many more environments should I include to achieve that generalization. More specifically, TLT uses sequences to describe a different video, how many of these sequences are generally used for good practices?

This is a general DL question, it depends on many aspect. We cannot draw an exact conclusion how many images are required to get a good accuracy on a dataset. More training data is better. But it will cost more training time. Suggest you to train part of your dataset in order to tune the hyper-parameters. Then increase the dataset to improve the mAP further.

For TLT training data, you can see below for reference.
PeopleNet - https://ngc.nvidia.com/catalog/models/nvidia:tlt_peoplenet

TrafficCamNet - https://ngc.nvidia.com/catalog/models/nvidia:tlt_trafficcamnet

DashCamNet - https://ngc.nvidia.com/catalog/models/nvidia:tlt_dashcamnet

FaceDetectIR - https://ngc.nvidia.com/catalog/models/nvidia:tlt_facedetectir