Description
While performing transfer learning of PeopleNet model () on COCO dataset using TAO Toolkit, several training performance-related issues were encountered. The issues are summarised below.
*Issue Summary
Unable to improve person and bag detection through transfer learning. Baseline performance was < 10% mAP before training and after training, no improvements were seen. In some cases, a degradation of performance was encountered (0% mAP).
Approaches used:
using a balanced dataset with same number of labels for each class ~12,000
using a range of initial learning rates from 5e-6 to 0.1
removed all other classes from COCO dataset and only retaining person and bag classes
training with various resolutions - (960, 544), (1280, 720)
with and without freezing layers
using ground truth bounding box labels that are tightly cropped to the objects
Since the original images in COCO dataset have different kinds of resolutions, it is needed to set enable_auto_resize: true in the spec file. Please set it and train again.
More, since the COCO images have various background or context, it may be quite different from the dataset mentioned in peoplenet model card PeopleNet | NVIDIA NGC . Thus, degradation of performance comparing to the model card may happen.
@Morganh Hi, we have already resized the images to the same resolution and adjusted the labels prior to loading them in TAO Toolkit.
This was done by setting a fixed resolution. For example, 1280 x 720, then scaling the image to the closest size, while maintaining the aspect ratio, then padding the image to achieve target resolution.
There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks
For your existing result, is the mAP still near 0 from tao evaluation? Could you run tao inference to double check?
More, you can also train from scratch against your training dataset. That means training without the peoplenet pretrained model. Usually, mAP 0 is not expected. For input size, you can consider using average resolution of the training dataset. For example, 512x512.