Facenet question

Please provide the following information when requesting support.

Hardware: T4
Network type: Facenet, FPEnet

I’m having ported parts of the TAO facial landmarks sample app to GOLANG deepstream_tao_apps/apps/tao_others/deepstream-faciallandmark-app at master · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub

I’m not interested in annotating video with facial landmarks. I’m interested in “fingerprinting” faces in order to be able to recognize them again

  • My input is an RTSP video 30 fps HD
  • A DeepStream 7 pipeline consisting of the primary detector running Facenet and a secondary detector running FPENet
  • Detected faces are square aligned (as the CPP app does it), so that W/H of the crop is identical, before it goes into the secondary model
  • Since resolution varies, I’m normalizing the landmark coordinates to the current resolution
  • In a training process I’m using 10 seconds of video from non moving faces (to not have to deal with changing landmark positions) over time
  • This finally gives me a “fingerprint” for a person, consisting of the 80 X/Y float landmarks, averaged from the taken images in time, considering the confidence
  • This can be repeated with different poses to have more fingerprints per person
  • Results are stored into a database

Now it comes:

  • In the recognition process I’m calculating the euclidian distance of landmark points (separately for chin, eyebrows, eyes, etc) and average this finally to a “distance” value between a stored fingerprint and the current test fingerprint (the current landmark tensor)
  • If the distance for a given database entry is below a certain threshold, I consider this as “recognised person”.
  • This unfortunately is giving ambiguous results (means: I hold my nose into the video and the other person is detected).

I have gathered some experiences with DLIB and I thought I have learned they would also use the geometric distance for facial landmarks, but honestly I’m not sure to be on the right track here. Especially if I’m - except that I’m making sure the face image to have the same width/ height - not doing any attempts to “morph” or “flatten” a face image, in case it is not exactly a frontal face shot.

Is there any information, how to use the facial landmarks as the come out of FPENet for Recognition again?

Makes sense or is there anything else to do?

You can use ReID network. See more info in ReIdentificationNet - NVIDIA Docs or ReIdentificationNet Transformer - NVIDIA Docs.

Thanks. Is there sample code?

Yes, you can take a look at ReIdentificationNet - NVIDIA Docs.

The TAO Triton Apps provide an inference sample for ReIdentificationNet. It consumes a TensorRT engine and supports running with a directory of query (probe) images and a directory of test (gallery) images containing the same identities.

Hmm. Seems to be all Triton server stuff. I don’t find DeepStream related code

…even though there DeepStream is mentioned. Well, again through the NVIDIA revolving door and back? No thanks.

Maybe this deepstream_tao_apps/apps/tao_others/deepstream-mdx-perception-app at master · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub

Yes, you can take a look at this.

OK, using this model now as SGIE after a PGIE running Facenet.

Added a probe to the src pad of the SGIE and got presumably tensor data:

The model says, it gives an output layer “fc_pred” with 256 floats (?):

I carved that out and got this, for example (layer name in the info did match):

embeddings: [-0.36572266 0.5307617 -0.8671875 -0.7807617 -0.60546875 1.3427734 -1.6210938 0.90771484 -0.17700195 -0.93359375 0.94873047 -0.9091797 0.20812988 -0.056762695 -0.56103516 -0.7973633 0.18334961 -2.2851562 0.22570801 1.0263672 -0.3100586 0.1784668 -0.33862305 -0.93310547 0.16625977 -2.3027344 -0.7553711 1.1425781 0.8222656 0.7519531 -1.1054688 0.60253906 -0.4658203 0.9379883 2.3613281 -1.0976562 0.99560547 1.4501953 -0.8442383 0.7216797 -2.1464844 -0.21130371 -0.17150879 2.6386719 -0.080322266 -0.53271484 0.5800781 1.0634766 -2.015625 -1.8398438 0.87841797 -1.6191406 0.9970703 2.7519531 0.55322266 -1.3710938 0.93359375 -0.10864258 -0.5961914 -1.7578125 -1.2734375 2.0859375 -2.1621094 0.5625 1.5986328 -0.8388672 0.22949219 -2.015625 -1.5136719 0.09442139 -1.4746094 -1.0693359 -0.609375 -2.9589844 -2.7480469 -0.89501953 -1.2939453 -1.2285156 -1.0126953 -1.2128906 2.8945312 -1.6474609 2.5097656 -1.7705078 -0.6010742 -1.5771484 0.3972168 -0.07128906 -0.5883789 -1.1298828 0.31323242 0.21472168 1.0722656 -0.1685791 1.8994141 1.0253906 0.42578125 1.4296875 0.7939453 0.6040039 -0.6123047 -1.3125 -1.7753906 1.1240234 -3.1308594 3.2558594 -0.6767578 -0.94677734 -3.8027344 0.68408203 0.34375 -0.68115234 0.3852539 -1.2880859 -2.984375 -0.07647705 -1.90625 0.9506836 0.3408203 -0.8779297 2.0820312 -1.7373047 -1.5419922 -0.099487305 -0.123168945 0.61816406 1.0009766 -0.34692383 0.28393555 1.7802734 -0.94433594 0.39624023 1.5283203 -0.7138672 -1.2841797 0.9194336 -1.703125 -0.26489258 -1.2050781 -0.2319336 0.74902344 -0.1850586 -0.7446289 0.12310791 1.1494141 0.62060547 -1.0830078 -0.16320801 0.8305664 0.7524414 2.2558594 -1.1767578 1.4414062 -0.9663086 0.09509277 2.4140625 2.78125 0.7211914 2.5234375 -0.4404297 2.8535156 -1.6152344 0.49414062 -0.8432617 -2.0078125 -0.9301758 -0.23168945 1.9335938 1.0498047 -0.6699219 0.06359863 0.5620117 -0.5732422 0.8129883 -0.3359375 -2.8476562 -0.69628906 0.19274902 1.8759766 0.75341797 -0.4506836 0.57373047 0.08453369 -1.0058594 -0.03451538 0.1505127 2.2089844 -0.3671875 1.6787109 0.33666992 -1.6699219 -1.2402344 -1.0644531 -0.4375 0.10021973 0.3569336 -1.0996094 1.0175781 -0.828125 -0.17504883 -0.2668457 -0.35742188 -0.09515381 -1.5126953 -1.4082031 -0.90625 0.5800781 0.70410156 0.18493652 0.60595703 -1.0371094 -0.9711914 1.015625 1.9238281 1.7060547 0.3581543 0.068847656 0.67333984 -1.0283203 0.15551758 -0.47509766 -1.2880859 1.1484375 -1.9267578 0.32617188 2.7910156 0.049438477 -0.1619873 -2.0996094 -0.91308594 0.21789551 0.39624023 -1.8847656 -0.7080078 0.87158203 -0.51464844 1.5537109 -2.2988281 -0.109436035 -2.7929688 -0.28149414 -0.2919922 1.4033203 -1.7675781 -0.04232788 0.54833984 0.9633789 -0.025436401 -1.2763672 0.6015625 -0.5151367 2.7109375 -1.3115234 -0.22998047 -1.1132812 0.7006836]

Question: Is this now something I could save in a training phase and use later on for recognition by calculating the euclidian distance to a test vector?

Sorry if this is a stupid question.

The output layer “fc_pred” is a tensor of float32[batch,embedding_size] .
You can download a model from ReIdentificationNet | NVIDIA NGC.

Sorry, how to understand this? batch times embedding_size? I get exactly 256 floats. If I try to obtain more, the rest is filled with 0

Yes, for this ngc onnx model, the output size of the feature embeddings is 256.

Thanks for the confirmation. I guess I’m on the right track trying to calculate euclidian distances between these embeddings

Is my understanding right: This embedding is basically just a fingerprint of the face, right? No other information contained.

This onnx model generates embeddings for identifying people captured in different scenes. Not for face. For more info, you can refer to

  • H. Luo, Y. Gu, X. Liao, S. Lai and W. Jiang, “Bag of Tricks and a Strong Baseline for Deep Person Re-Identification,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, 1 pp. 1487-1495, doi: 10.1109/CVPRW.2019.00190.
  • L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang and Q. Tian, “Scalable Person Re-identification: A Benchmark,” 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1116-1124, doi: 10.1109/ICCV.2015.133.
  • M. Naphade, S. Wang, D. C. Anastasiu, Z. Tang, M.-C. Chang, Y. Yao, L. Zheng, M. S. Rahman, M. S. Arya, A. Sharma, Q. Feng, V. Ablavsky, S. Sclaroff, P. Chakraborty, S. Prajapati, A. Li, S. Li, K. Kunadharaju, S. Jiang and R. Chellappa, “The 7th AI City Challenge,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023.

This onnx model generates embeddings for identifying people captured in different scenes. Not for face. For more info, you can refer to

Well, you recommended it. I didn’t speak about person identification in the first post.

For face, you can train a new ReID model with some face dataset.

There is always the problem with the assets… I don’t have face datasets, and what you can get for free is bullshit and the entire TAO training is a nightmarish experience, IMHO. No, I’d rather go with a combination of Facenet and DLIB, this is not that performant, but pretty reliable.

The ngc model cannot cover all the scenarios. And TAO is designed to end user to finetune their own dataset with the ngc pretrained model.

I’m wondering how one could create some reference vectors (“mug shots”) of persons in order to use this later on for distance comparisons

I mean, it might not necessarily be a DeepStream question, more a GStreamer question: How to have a pipeline fed from still images instead of a video stream? In order to run a pipeline with JPEG input → primary → secondary inference. Somehow something known about this?

Another option would be to feed the pipeline with a say 10 second video and making N snapshots of the fc_pred vector in order to assign this to a person for later comparisons…