Dataset Practices

This is a general DL question, it depends on many aspect. We cannot draw an exact conclusion how many images are required to get a good accuracy on a dataset. More training data is better. But it will cost more training time. Suggest you to train part of your dataset in order to tune the hyper-parameters. Then increase the dataset to improve the mAP further.

For TLT training data, you can see below for reference.
PeopleNet - https://ngc.nvidia.com/catalog/models/nvidia:tlt_peoplenet

TrafficCamNet - https://ngc.nvidia.com/catalog/models/nvidia:tlt_trafficcamnet

DashCamNet - https://ngc.nvidia.com/catalog/models/nvidia:tlt_dashcamnet

FaceDetectIR - https://ngc.nvidia.com/catalog/models/nvidia:tlt_facedetectir