Is DeepStream suitable for face detection using MTCNN?

Hi,

I am very new to DeepStream. I am seeking for advice. Could you kindly help?

I want to implement a face detection application using MTCNN model with DeepStream running on Jetson Nano. It requires cascading 3 different models: PNet => RNet => ONet. That can be done by chaining the primary GIE with secondary GIEs. The primary GIE will be using PNet model.

However, according to MTCNN algorithm, the input of PNet model must be an image pyramid. That is: each frame of the stream will scaled down to different sizes. Then each scaled image will be fed to PNet. The result bounding boxes will be converted to the original image coordinate. Then doing NMS. Repeating that for all images in the pyramid and gathering all bounding boxes. The collection of bounding boxes will be passed to the NMS one more time.

How could I implement that with DeepStream for the primary GIE?

From what I know about DeepStream, the entire frame dimension will be passed to nvinfer at PGIE by the streammuxer. There seems to be no way to inject custom code in order to build the image pyramid. The only place that I can do customization to nvinfer is at the bounding box parser function. But assuming the pyramid can be built, at the bounding box parser function, I will not know what is the scale factor of the image being inferred in order to convert the bounding boxes into the original image coordinate.

Thanks for your help

Hi @hotribao ,
Sorry for long delay!

I found topics below which can run mtcnn by TRT, seems they both do not mention image pyramid. Coul you share more details about image pyramid?

Thanks!

hi @mchi ,

Thanks a lot for your response. All responses are valuable to me as I am just a new DeepStream learner and have been scratching my head to find a way to fit my application into DeepStreams’ paradigm and still couldn’t find a way out…

Before creating this topic, I searched a lot on this forum for all posts related to DeepStream+MTCNN but couldn’t see anyone who claimed to successfully implement this combination. I saw the post you mentioned as well however it is a standalone TRT application. Instead of writing a standalone TRT application, I would like to use DeepStream in order to utilize all of its accelerators, not just only TRT.

In the source code that you quoted, it does mention about the pyramids. There, it calculates the list of scale factors to be applied to the input image, stored into the vector scales_

   /*config  the pyramids */
    float minl = row<col?row:col;
    int MIN_DET_SIZE = 12;
    float m = (float)MIN_DET_SIZE/minsize;
    minl *= m;
    float factor = 0.709;
    int factor_count = 0;
    while(minl>MIN_DET_SIZE){
        if(factor_count>0)m = m*factor;
        scales_.push_back(m);
        minl *= factor;
        factor_count++;
    }

With an 640x480 image, there will be 7 scales.

Then coming down a bit, for each scale, it prepares/generates a TRT engine of PNet model which is dedicated for the given input shape. This is needed when running in TRT, but in the original algorithm, this step is not needed.

https://github.com/PKUZHOU/MTCNN_FaceDetection_TensorRT/blob/dfad60565216a68413f434b500168c456fdd2587/src/mtcnn.cpp#L41

    //generate pnet models
    pnet_engine = new Pnet_engine[scales_.size()];
    simpleFace_ = (Pnet**)malloc(sizeof(Pnet*)*scales_.size());
    for (size_t i = 0; i < scales_.size(); i++) {
        int changedH = (int)ceil(row*scales_.at(i));
        int changedW = (int)ceil(col*scales_.at(i));
        pnet_engine[i].init(changedH,changedW);
        simpleFace_[i] =  new Pnet(changedH,changedW,pnet_engine[i]);
    }

Next, in the function “findFace”, we can see the input image is scaled then fed to the PNet model

https://github.com/PKUZHOU/MTCNN_FaceDetection_TensorRT/blob/master/src/mtcnn.cpp#L69

    for (size_t i = 0; i < scales_.size(); i++) {
        int changedH = (int)ceil(image.rows*scales_.at(i));
        int changedW = (int)ceil(image.cols*scales_.at(i));
        clock_t run_first_time = clock();
        resize(image, reImage, Size(changedW, changedH), 0, 0, cv::INTER_LINEAR);
        (*simpleFace_[i]).run(reImage, scales_.at(i),pnet_engine[i]);

Now, come my first obstacle: the Primary GIE in DeepStream only accept one image. But this algorithm requires that the input image is scaled into multiple sizes (down scale).

In the post you mentioned, the one who implemented it in a standalone TRT application takes another approach: he scales the input image into multiple smaller sizes then stack all of them into one big image and feed it into the PNet network. In whatever approach, with MTCNN, it always requires image pre-processing step which I couldn’t see DeepStream supports it.

In general, I know DeepStream supports custom bounding box parser function which is the post-processing phase. How about image pre-processing not only at the Primary GIE but also at the Secondary GIE? In face recognition application, a detected face box needs to be aligned using 5 facial landmarks. That will be the required pre-processing for an SGIE.

At the last line of the above quoted source code ( (*simpleFace_[i]).run(.... ), when going inside method “run” of class Pnet, we will see it calls method “generateBbox”. That method transforms bounding boxes found on the scaled image into the coordinate of the original input image

https://github.com/PKUZHOU/MTCNN_FaceDetection_TensorRT/blob/master/src/pnet_rt.cpp#L133

                bbox.x1 = round((stride * row + 1) / scale);
                bbox.y1 = round((stride * col + 1) / scale);
                bbox.x2 = round((stride * row + 1 + cellsize) / scale);
                bbox.y2 = round((stride * col + 1 + cellsize) / scale);

there comes my second obstacle: assuming the first obstacle solved, at the custom bounding box parser function, the scale being used is unknown.

One quick question, is it possible for you to use TLT FaceDetectIR model Transfer Learning Toolkit (TLT) Integration with DeepStream — DeepStream 5.1 Release documentation

Yes. While waiting for response of this topic, I tried FaceDetect and it works. If MTCNN doesn’t work with DeapStream, I will have to use it instead. Just it doesn’t return 5 facial landmarks points like MTCN, they are used to do face alignment.