How are occluded points, face bounding boxes and tfrecord generation handled in Fpenet custom training? Very poor custom retraining results

sahil.bahl · November 29, 2022, 8:37pm

Background
The off the shelf FPEnet model gives poor results when the face is tilted to the left or right/ low lighting / sun glare etc.
(Facial Landmarks Estimation | NVIDIA NGC)

So we decided to fine tune the Fpenet model using only 16 points on our custom dataset.
We have run training using the Fpenet ipython notebook
https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_quick_start_guide.html

16 Point Labelling

1-6 Points of eye on the left in the image
7 - 12 Points on the eye on the right in the image
13 Nose Point
14 Mouth corner on the left in the image
15 Mouth corner on the right in the image
16 Chin point

On images where all the points are not visible, we have marked them as occluded based on this format
https://docs.nvidia.com/tao/tao-toolkit/text/data_annotation_format.html#json-label-data-format

Face Bounding Box Labelling
Single rectangle face bounding box is labelled and added as ground truth to the labelling job.
I have set the outer and tighter bounding box values the same.

Label Json File
afw.json (643.9 KB)

Example where some points are occluded:

 "filename": "/workspace/tao-experiments/fpenet/afw/smartdvr-1424221015803-usb-Generic_Camera-RGB_200901010001-video-index0_20220928060534-20220928060625-45.png",
        "class": "image",
        "annotations": [
            {
                "class": "FaceBbox",
                "tool-version": "1.0",
                "Occlusion": 0,
                "face_outer_bboxx": 548.0,
                "face_outer_bboxy": 50.0,
                "face_outer_bboxwidth": 251.0,
                "face_outer_bboxheight": 425.0,
                "face_tight_bboxx": 548.0,
                "face_tight_bboxy": 50.0,
                "face_tight_bboxwidth": 251.0,
                "face_tight_bboxheight": 425.0
            },
            {
                "tool-version": "1.0",
                "version": "v1",
                "class": "FiducialPoints",
                "P1x": 0.0,
                "P1y": 0.0,
                "P1occluded": true,
                "P2x": 0.0,
                "P2y": 0.0,
                "P2occluded": true,
                "P3x": 0.0,
                "P3y": 0.0,
                "P3occluded": true,
                "P4x": 0.0,
                "P4y": 0.0,
                "P4occluded": true,
                "P5x": 0.0,
                "P5y": 0.0,
                "P5occluded": true,
                "P6x": 0.0,
                "P6y": 0.0,
                "P6occluded": true,
                "P7x": 0.0,
                "P7y": 0.0,
                "P7occluded": true,
                "P8x": 0.0,
                "P8y": 0.0,
                "P8occluded": true,
                "P9x": 0.0,
                "P9y": 0.0,
                "P9occluded": true,
                "P10x": 0.0,
                "P10y": 0.0,
                "P10occluded": true,
                "P11x": 0.0,
                "P11y": 0.0,
                "P11occluded": true,
                "P12x": 0.0,
                "P12y": 0.0,
                "P12occluded": true,
                "P13x": 792.0,
                "P13y": 277.0,
                "P14x": 723.0,
                "P14y": 360.0,
                "P15x": 0.0,
                "P15y": 0.0,
                "P15occluded": true,
                "P16x": 696.0,
                "P16y": 449.0
            }
        ]
}

Training spec file:
experiment_spec_16.yaml (2.2 KB)

Dataset Size: 338 images

Results:
The inference results even on the training set images (especially when points are occluded) is completely wrong, So I am not sure whether the experiment config is correct and the training is actually using the images with occlusions.

Occluded Image Results

Other Results

Questions:

Is the face bounding box set in the json actually being used or is it recalculated ?
When I look at the tensorboard image examples, the images are cropped differently and doesn’t seem to be using the bounding box provided. Is the bounding box recalculated based on the points? How does that work when you only have partial points are labelled eg. just the eye points ?
What is the preprocessing logic applied in the dataset_convert step?
I see that around 50 images are dropped when the tfrecord file is generated, but could not find any documentation explaining the discrepancy. Keen to understand what images are removed and on what criteria.

Screen Shot 2022-11-30 at 9.19.11 AM814×132 21.5 KB
How are the images with some occluded points handled? Are they used in training?
It feels like the images that have some occluded points are not being used in training.
I have tried set the points as the following: No coordinates only set occluded: Eg. “P9occluded”: true
Coordinates with fixed point and occluded flag Eg. “P9x”: 45, “P9y”: 45, “P9occluded”: true
How to get confidence score when running tao inference for the points in the notebook

I want to check the confidence score for the output points. How can I get the confidence score in the tao inference command. I could not find any relevant flag in the help docs.

Can someone please help answer the above questions, as we want to ensure we have the right setup before labelling more data and retraining ?
Thanks

Morganh · November 30, 2022, 9:18am

When run “fpenet dataset_convert -e dataset_config.yaml”, during the tfrecords generation, it will call “detect_bbox” function. For this function, please refer to jupyter notbeook or download via command
wget --content-disposition ‘https://api.ngc.nvidia.com/v2/resources/nvidia/tao/cv_samples/versions/v1.4.1/files/fpenet/sample_calibration_images.py’ .
If bbox is none, this image will be dropped.

It is not available for end user yet.

sahil.bahl · November 30, 2022, 11:48pm

Is the face bounding box set in the json actually being used or is it recalculated ?

Why is the bounding box recalculated just using the key points when I have also supplied the face bbox ground truth in the annotation file ? What is the purpose of the bbox in the ground truth file ?

What will happen with file like these where we only label the right eye points ?
Because the bounding box generated using the keypoints will not cover the whole face?
Is there a threshold for the bbox size because I don’t see that being applied in the detect_bbox method

sahil.bahl · November 30, 2022, 11:50pm

How are the images with some occluded points handled? Are they used in training?

What happens to images with occlude points? We are retraining model with only 16 points and all of them are not present all the time? Is the expectation to label all those points for all the images with an approximation?

Morganh · December 1, 2022, 9:02am

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

The annotation file just provide all the keypoints. FPEnet will find the xmin, ymin, xmax, ymax of the points and then calculate a square face bounding box based on the key points. And then crop bounding box from image and scale the Keypoints to target resolution(80x80 in your spec file).

It does not meet the label guide. For your case, please label all the 16 keypoints in the correct order as accurately as possible. See more info in Facial Landmarks Estimation | NVIDIA NGC .

There is not threshold. The bbox is calculated by the xmin, ymin, xmax, ymax of the points. And then be cropped and scaled to target resolution(80x80 in your spec file).

The occluded points are not included in the training.

Points that are not visible in the image need to be accounted for. When a point is not visible, please label in the general area of where the point should be.
For example, although the eyes are closed you can approximate the location of the pupils.

system · January 2, 2023, 1:38pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
BodyPoseNet trained with custom dataset not detecting TAO Toolkit	21	847	June 6, 2022
Label information required for Gazenet training TAO Toolkit	2	387	October 12, 2021
FPENet inference confidence seems low DeepStream SDK	9	692	November 29, 2022
Fpenet: What training parameters should be modified to enlarge the bbox TAO Toolkit	6	672	November 1, 2022
FPENet inference confidence seems low TAO Toolkit	26	894	December 8, 2022
Facial Landmark Estimator (FPENet) annotation guidelines TAO Toolkit	3	943	May 17, 2022
Incorrect "arch" value for efficientnet_b1_relu from the NGC pretrained model repo TAO Toolkit	27	1056	March 9, 2022
Error detectnet_V2 train with TAO : dbscan_min_samples: 0.05' TAO Toolkit tao	4	387	November 7, 2023
MAJOR ACCURACY LOSS when EXPORTING tao unet model after retraining pruned model TAO Toolkit	29	1333	November 22, 2022
How to generate inference_sample.json file and the bbox annotations for fpenet? TAO Toolkit tao	6	468	May 5, 2023

How are occluded points, face bounding boxes and tfrecord generation handled in Fpenet custom training? Very poor custom retraining results

Related topics