Label information required for Gazenet training

• Network Type (gazenet)
• TLT Version (v3)

Hello,
I am learning TLT for GazeNet. I want to convert my own dataset(json format) to tfrecords format.

Transfer Learning Toolkit V3.0
tlt_cv_samples_v1.1.0/gazenet/gazenet.ipynb
3. Generate tfrecords from labels in json format

I have a hard time annotation, so please tell me the minimum parameters required for this model. The guide below stated that FaceBox and FiducialPoint are required.
https://docs.nvidia.com/tlt/tlt-user-guide/text/data_annotation_format.html#json-label-data-format

However, FiducialPoint has many parameters, so it is difficult to annotate/labeling the data.

Question 1:
What label information do I need to retrain this model?

Question 2:
In the notebook procedure, the MPII dataset is converted to the following by the python program.
・ Data
・ Labels (json)
・ Config

  1. Prepare dataset and pre-trained model
    B. Convert datasets and labels to required format

Could I use this config file for my own dataset?
I can’t understand the parameters in the config files.

  1. Please see NVIDIA NGC

The training dataset is created by labeling ground-truth bounding-boxes and landmarks by human labelers. The face bounding box and fiducial landmarks are used to prepare inputs (face crop image, left eye crop image, right eye crop image, and facegrid) to the gaze model. For Face bounding boxes labeling, please refer to the FaceNet model card. For Facial landmarks labeling, please refer to the FPENet model card.

==> NVIDIA NGC

Training Data Ground-truth Labeling Guidelines

The ground truth dataset is created by labeling ground-truth facial keypoints by human labellers.

If you are looking to re-train with your own dataset, please follow the guideline below.

  • Label the keypoints in the correct order as accuractely as possible. The human labeler would be able to zoom in to a face region to correctly localize the keypoint.
  • For keypoints that are not easily distinguishable such as chin or nose, the best estimate should be made by the human labeler. Some keypoints are easily distinguishable such as mouth corners or eye corners.
  • Label a keypoint as “occluded” if the keypoint is not visible due to an external object or due to extreme head pose angles. A keypoint is considered occluded when the keypoint is in the image but not visible.
  • To reduce discrepency in labeling between multiple human labelers, the same keypoint ordering and instructions should be used across labelers. An independent human labeler may be used to test the quality of the annotated landmarks and potential corrections.

Face bounding boxes labeling:

  • Face bounding boxes should be as tight as possible.
  • Label each face bounding box with an occlusion level ranging from 0 to 9. 0 means the face is fully visible and 9 means the face is 90% or more occluded. For training, only faces with occlusion level 0-5 are considered.
  • The datasets consist of webcam images so truncation is rarely seen. If faces are at the edge of the frame with visibility less than 60% due to truncation, this image is dropped from the dataset.

The Sloth and Label-Studio tools have been utilized for labeling.

==> NVIDIA NGC

Training Data Ground-truth Labeling Guidelines

The training dataset is created by labeling ground-truth bounding-boxes and categories by human labellers. Following guidelines were used while labelling the training data for NVIDIA FaceNet model.

FaceNet project labelling guidelines

  • Face bounding boxes should be as tight as possible.
  • Label each face bounding box with an occlusion level ranging from 0 to 9. 0 means the face is fully visible and 9 means the face is 90% or more occluded. For training, only faces with occlusion level 0-5 are considered.
  • If faces are at the edge of the frame with visibility less than 60% due to truncation, this image is dropped from the dataset.
  1. For the parameters in the config files, you can refer to tlt_cv_samples_v1.1.0/gazenet/utils_gazeviz.py