This is a general DL question, it depends on many aspect. We cannot draw an exact conclusion how many images are required to get a good accuracy on a dataset. More training data is better. But it will cost more training time. Suggest you to train part of your dataset in order to tune the hyper-parameters. Then increase the dataset to improve the mAP further.
For TLT training data, you can see below for reference.
PeopleNet - https://ngc.nvidia.com/catalog/models/nvidia:tlt_peoplenet
TrafficCamNet - https://ngc.nvidia.com/catalog/models/nvidia:tlt_trafficcamnet
DashCamNet - https://ngc.nvidia.com/catalog/models/nvidia:tlt_dashcamnet
FaceDetectIR - https://ngc.nvidia.com/catalog/models/nvidia:tlt_facedetectir