I’m not interested in annotating video with facial landmarks. I’m interested in “fingerprinting” faces in order to be able to recognize them again
My input is an RTSP video 30 fps HD
A DeepStream 7 pipeline consisting of the primary detector running Facenet and a secondary detector running FPENet
Detected faces are square aligned (as the CPP app does it), so that W/H of the crop is identical, before it goes into the secondary model
Since resolution varies, I’m normalizing the landmark coordinates to the current resolution
In a training process I’m using 10 seconds of video from non moving faces (to not have to deal with changing landmark positions) over time
This finally gives me a “fingerprint” for a person, consisting of the 80 X/Y float landmarks, averaged from the taken images in time, considering the confidence
This can be repeated with different poses to have more fingerprints per person
Results are stored into a database
Now it comes:
In the recognition process I’m calculating the euclidian distance of landmark points (separately for chin, eyebrows, eyes, etc) and average this finally to a “distance” value between a stored fingerprint and the current test fingerprint (the current landmark tensor)
If the distance for a given database entry is below a certain threshold, I consider this as “recognised person”.
This unfortunately is giving ambiguous results (means: I hold my nose into the video and the other person is detected).
I have gathered some experiences with DLIB and I thought I have learned they would also use the geometric distance for facial landmarks, but honestly I’m not sure to be on the right track here. Especially if I’m - except that I’m making sure the face image to have the same width/ height - not doing any attempts to “morph” or “flatten” a face image, in case it is not exactly a frontal face shot.
Is there any information, how to use the facial landmarks as the come out of FPENet for Recognition again?
The TAO Triton Apps provide an inference sample for ReIdentificationNet. It consumes a TensorRT engine and supports running with a directory of query (probe) images and a directory of test (gallery) images containing the same identities.
Question: Is this now something I could save in a training phase and use later on for recognition by calculating the euclidian distance to a test vector?
This onnx model generates embeddings for identifying people captured in different scenes. Not for face. For more info, you can refer to
H. Luo, Y. Gu, X. Liao, S. Lai and W. Jiang, “Bag of Tricks and a Strong Baseline for Deep Person Re-Identification,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, 1 pp. 1487-1495, doi: 10.1109/CVPRW.2019.00190.
L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang and Q. Tian, “Scalable Person Re-identification: A Benchmark,” 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1116-1124, doi: 10.1109/ICCV.2015.133.
M. Naphade, S. Wang, D. C. Anastasiu, Z. Tang, M.-C. Chang, Y. Yao, L. Zheng, M. S. Rahman, M. S. Arya, A. Sharma, Q. Feng, V. Ablavsky, S. Sclaroff, P. Chakraborty, S. Prajapati, A. Li, S. Li, K. Kunadharaju, S. Jiang and R. Chellappa, “The 7th AI City Challenge,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023.
There is always the problem with the assets… I don’t have face datasets, and what you can get for free is bullshit and the entire TAO training is a nightmarish experience, IMHO. No, I’d rather go with a combination of Facenet and DLIB, this is not that performant, but pretty reliable.
I’m wondering how one could create some reference vectors (“mug shots”) of persons in order to use this later on for distance comparisons
I mean, it might not necessarily be a DeepStream question, more a GStreamer question: How to have a pipeline fed from still images instead of a video stream? In order to run a pipeline with JPEG input → primary → secondary inference. Somehow something known about this?
Another option would be to feed the pipeline with a say 10 second video and making N snapshots of the fc_pred vector in order to assign this to a person for later comparisons…