How are occluded points, face bounding boxes and tfrecord generation handled in Fpenet custom training? Very poor custom retraining results

Background
The off the shelf FPEnet model gives poor results when the face is tilted to the left or right/ low lighting / sun glare etc.
(Facial Landmarks Estimation | NVIDIA NGC)

So we decided to fine tune the Fpenet model using only 16 points on our custom dataset.
We have run training using the Fpenet ipython notebook
https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_quick_start_guide.html

16 Point Labelling

  • 1-6 Points of eye on the left in the image

  • 7 - 12 Points on the eye on the right in the image

  • 13 Nose Point

  • 14 Mouth corner on the left in the image

  • 15 Mouth corner on the right in the image

  • 16 Chin point

On images where all the points are not visible, we have marked them as occluded based on this format
https://docs.nvidia.com/tao/tao-toolkit/text/data_annotation_format.html#json-label-data-format

Face Bounding Box Labelling
Single rectangle face bounding box is labelled and added as ground truth to the labelling job.
I have set the outer and tighter bounding box values the same.

Label Json File
afw.json (643.9 KB)

Example where some points are occluded:

 "filename": "/workspace/tao-experiments/fpenet/afw/smartdvr-1424221015803-usb-Generic_Camera-RGB_200901010001-video-index0_20220928060534-20220928060625-45.png",
        "class": "image",
        "annotations": [
            {
                "class": "FaceBbox",
                "tool-version": "1.0",
                "Occlusion": 0,
                "face_outer_bboxx": 548.0,
                "face_outer_bboxy": 50.0,
                "face_outer_bboxwidth": 251.0,
                "face_outer_bboxheight": 425.0,
                "face_tight_bboxx": 548.0,
                "face_tight_bboxy": 50.0,
                "face_tight_bboxwidth": 251.0,
                "face_tight_bboxheight": 425.0
            },
            {
                "tool-version": "1.0",
                "version": "v1",
                "class": "FiducialPoints",
                "P1x": 0.0,
                "P1y": 0.0,
                "P1occluded": true,
                "P2x": 0.0,
                "P2y": 0.0,
                "P2occluded": true,
                "P3x": 0.0,
                "P3y": 0.0,
                "P3occluded": true,
                "P4x": 0.0,
                "P4y": 0.0,
                "P4occluded": true,
                "P5x": 0.0,
                "P5y": 0.0,
                "P5occluded": true,
                "P6x": 0.0,
                "P6y": 0.0,
                "P6occluded": true,
                "P7x": 0.0,
                "P7y": 0.0,
                "P7occluded": true,
                "P8x": 0.0,
                "P8y": 0.0,
                "P8occluded": true,
                "P9x": 0.0,
                "P9y": 0.0,
                "P9occluded": true,
                "P10x": 0.0,
                "P10y": 0.0,
                "P10occluded": true,
                "P11x": 0.0,
                "P11y": 0.0,
                "P11occluded": true,
                "P12x": 0.0,
                "P12y": 0.0,
                "P12occluded": true,
                "P13x": 792.0,
                "P13y": 277.0,
                "P14x": 723.0,
                "P14y": 360.0,
                "P15x": 0.0,
                "P15y": 0.0,
                "P15occluded": true,
                "P16x": 696.0,
                "P16y": 449.0
            }
        ]
}

Training spec file:
experiment_spec_16.yaml (2.2 KB)

Dataset Size: 338 images

Results:
The inference results even on the training set images (especially when points are occluded) is completely wrong, So I am not sure whether the experiment config is correct and the training is actually using the images with occlusions.

Occluded Image Results



Other Results



Questions:

  • Is the face bounding box set in the json actually being used or is it recalculated ?
    When I look at the tensorboard image examples, the images are cropped differently and doesn’t seem to be using the bounding box provided. Is the bounding box recalculated based on the points? How does that work when you only have partial points are labelled eg. just the eye points ?
    imageData

  • What is the preprocessing logic applied in the dataset_convert step?
    I see that around 50 images are dropped when the tfrecord file is generated, but could not find any documentation explaining the discrepancy. Keen to understand what images are removed and on what criteria.

  • How are the images with some occluded points handled? Are they used in training?
    It feels like the images that have some occluded points are not being used in training.
    I have tried set the points as the following: No coordinates only set occluded: Eg. “P9occluded”: true
    Coordinates with fixed point and occluded flag Eg. “P9x”: 45, “P9y”: 45, “P9occluded”: true

  • How to get confidence score when running tao inference for the points in the notebook
    Screen Shot 2022-11-30 at 10.42.24 AM
    I want to check the confidence score for the output points. How can I get the confidence score in the tao inference command. I could not find any relevant flag in the help docs.

Can someone please help answer the above questions, as we want to ensure we have the right setup before labelling more data and retraining ?
Thanks

When run “fpenet dataset_convert -e dataset_config.yaml”, during the tfrecords generation, it will call “detect_bbox” function. For this function, please refer to jupyter notbeook or download via command
wget --content-disposition ‘https://api.ngc.nvidia.com/v2/resources/nvidia/tao/cv_samples/versions/v1.4.1/files/fpenet/sample_calibration_images.py’ .
If bbox is none, this image will be dropped.

It is not available for end user yet.

Is the face bounding box set in the json actually being used or is it recalculated ?

Why is the bounding box recalculated just using the key points when I have also supplied the face bbox ground truth in the annotation file ? What is the purpose of the bbox in the ground truth file ?

What will happen with file like these where we only label the right eye points ?
Because the bounding box generated using the keypoints will not cover the whole face?
Is there a threshold for the bbox size because I don’t see that being applied in the detect_bbox method

How are the images with some occluded points handled? Are they used in training?

What happens to images with occlude points? We are retraining model with only 16 points and all of them are not present all the time? Is the expectation to label all those points for all the images with an approximation?

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

The annotation file just provide all the keypoints. FPEnet will find the xmin, ymin, xmax, ymax of the points and then calculate a square face bounding box based on the key points. And then crop bounding box from image and scale the Keypoints to target resolution(80x80 in your spec file).

It does not meet the label guide. For your case, please label all the 16 keypoints in the correct order as accurately as possible. See more info in Facial Landmarks Estimation | NVIDIA NGC .

There is not threshold. The bbox is calculated by the xmin, ymin, xmax, ymax of the points. And then be cropped and scaled to target resolution(80x80 in your spec file).

The occluded points are not included in the training.

Points that are not visible in the image need to be accounted for. When a point is not visible, please label in the general area of where the point should be.
For example, although the eyes are closed you can approximate the location of the pupils.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.