To train a customized DetectNet, ideally your training dataset would consist of several thousand annotated images. You should be able to detect pedestrian or animal-sized objects like from the tutorial without changing the detection sizes in network prototxt, but for smaller objects like faces or facial features, you may want to decrease the dimensions. You can consult the facenet model which was trained on FDDB. I would recommend to try working with FDDB first or this DIGITS pedestrian detection tutorial with KITTI.
I trained a DetectNet model to detect fishes with roughly 3,000 images in the training set. During training, I got mAP as high as 78.6. And the resulting model seemed to work OK on new (previously unseen) images. Check out my blog posts if you are interested.
The rule of thumb is that you need one input image per learned parameter in your network.
If your network has a million parameters (many large convolution kernels with many output channels and such,) that’s likely to be millions of images!
You can of course get good results with fewer training images, but there is significant risk of over-fitting in that case. This is what happens when training gets low loss but validation fails.
To get less over-fitting, try adding a few dropout layers in your network while training, setting dropout in each such layer to perhaps 50% or so.