Facial Landmark Estimator (FPENet) annotation guidelines

I would like to know if there is a certain guideline in addition to the one specified in the FPENet page in nvidia catalog https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/fpenet

In the overview section of the FPENet page, the landmarks are stated in the picture and numbered. Some points are clear how they should be annotated but some of the points are ambiguous especially the additional landmark points (81-104) and the pupil landmarks point (69-76). For example, points 60 - 68 how exactly they should be annotated is there a specific place to annotate these points and for the case if the mouth is closed should these point be overlapping or what is the case ?
For landmark points in the eyes there a lot of overlapping points that are unclear. For instance, the points labeling the pupils and for the additional eye landmark it is not very clear how they should be distributed inside the eye. For this reason, I would like to ask you if there is a more documented 104 keypoint landmark annotation guideline in order for the annotator to follow when labeling data

There is not more detailed info about the keypoint. Please still refer to its model card.

This model predicts 68, 80 or 104 keypoints for a given face- Chin: 1-17, Eyebrows: 18-27, Nose: 28-36, Eyes: 37-48, Mouth: 49-61, Inner Lips: 62-68, Pupil: 69-76, Ears: 77-80, additional eye landmarks: 81-104.


A pre-trained ( trainable ) model is available, trained on a combination of NVIDIA internal dataset and Multi-PIE dataset. NVIDIA internal data has approximately 500k images and Multipie has 750k images.

The ground truth dataset is created by labeling ground-truth facial keypoints by human labellers.

If you are looking to re-train with your own dataset, please follow the guideline below.

  • Label the keypoints in the correct order as accuractely as possible. The human labeler would be able to zoom in to a face region to correctly localize the keypoint.
  • For keypoints that are not easily distinguishable such as chin or nose, the best estimate should be made by the human labeler. Some keypoints are easily distinguishable such as mouth corners or eye corners.
  • Label a keypoint as “occluded” if the keypoint is not visible due to an external object or due to extreme head pose angles. A keypoint is considered occluded when the keypoint is in the image but not visible.
  • To reduce discrepency in labeling between multiple human labelers, the same keypoint ordering and instructions should be used across labelers. An independent human labeler may be used to test the quality of the annotated landmarks and potential corrections.

Face bounding boxes labeling:

  • Face bounding boxes should be as tight as possible.
  • Label each face bounding box with an occlusion level ranging from 0 to 9. 0 means the face is fully visible and 9 means the face is 90% or more occluded. For training, only faces with occlusion level 0-5 are considered.
  • The datasets consist of webcam images so truncation is rarely seen. If faces are at the edge of the frame with visibility less than 60% due to truncation, this image is dropped from the dataset.

The Sloth and Label-Studio tools have been utilized for labeling.

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.