hi @mchi ,
Thanks a lot for your response. All responses are valuable to me as I am just a new DeepStream learner and have been scratching my head to find a way to fit my application into DeepStreams’ paradigm and still couldn’t find a way out…
Before creating this topic, I searched a lot on this forum for all posts related to DeepStream+MTCNN but couldn’t see anyone who claimed to successfully implement this combination. I saw the post you mentioned as well however it is a standalone TRT application. Instead of writing a standalone TRT application, I would like to use DeepStream in order to utilize all of its accelerators, not just only TRT.
In the source code that you quoted, it does mention about the pyramids. There, it calculates the list of scale factors to be applied to the input image, stored into the vector scales_
/*config the pyramids */
float minl = row<col?row:col;
int MIN_DET_SIZE = 12;
float m = (float)MIN_DET_SIZE/minsize;
minl *= m;
float factor = 0.709;
int factor_count = 0;
while(minl>MIN_DET_SIZE){
if(factor_count>0)m = m*factor;
scales_.push_back(m);
minl *= factor;
factor_count++;
}
With an 640x480 image, there will be 7 scales.
Then coming down a bit, for each scale, it prepares/generates a TRT engine of PNet model which is dedicated for the given input shape. This is needed when running in TRT, but in the original algorithm, this step is not needed.
https://github.com/PKUZHOU/MTCNN_FaceDetection_TensorRT/blob/dfad60565216a68413f434b500168c456fdd2587/src/mtcnn.cpp#L41
//generate pnet models
pnet_engine = new Pnet_engine[scales_.size()];
simpleFace_ = (Pnet**)malloc(sizeof(Pnet*)*scales_.size());
for (size_t i = 0; i < scales_.size(); i++) {
int changedH = (int)ceil(row*scales_.at(i));
int changedW = (int)ceil(col*scales_.at(i));
pnet_engine[i].init(changedH,changedW);
simpleFace_[i] = new Pnet(changedH,changedW,pnet_engine[i]);
}
Next, in the function “findFace
”, we can see the input image is scaled then fed to the PNet model
https://github.com/PKUZHOU/MTCNN_FaceDetection_TensorRT/blob/master/src/mtcnn.cpp#L69
for (size_t i = 0; i < scales_.size(); i++) {
int changedH = (int)ceil(image.rows*scales_.at(i));
int changedW = (int)ceil(image.cols*scales_.at(i));
clock_t run_first_time = clock();
resize(image, reImage, Size(changedW, changedH), 0, 0, cv::INTER_LINEAR);
(*simpleFace_[i]).run(reImage, scales_.at(i),pnet_engine[i]);
Now, come my first obstacle: the Primary GIE in DeepStream only accept one image. But this algorithm requires that the input image is scaled into multiple sizes (down scale).
In the post you mentioned, the one who implemented it in a standalone TRT application takes another approach: he scales the input image into multiple smaller sizes then stack all of them into one big image and feed it into the PNet network. In whatever approach, with MTCNN, it always requires image pre-processing step which I couldn’t see DeepStream supports it.
In general, I know DeepStream supports custom bounding box parser function which is the post-processing phase. How about image pre-processing not only at the Primary GIE but also at the Secondary GIE? In face recognition application, a detected face box needs to be aligned using 5 facial landmarks. That will be the required pre-processing for an SGIE.
At the last line of the above quoted source code ( (*simpleFace_[i]).run(....
), when going inside method “run
” of class Pnet, we will see it calls method “generateBbox
”. That method transforms bounding boxes found on the scaled image into the coordinate of the original input image
https://github.com/PKUZHOU/MTCNN_FaceDetection_TensorRT/blob/master/src/pnet_rt.cpp#L133
bbox.x1 = round((stride * row + 1) / scale);
bbox.y1 = round((stride * col + 1) / scale);
bbox.x2 = round((stride * row + 1 + cellsize) / scale);
bbox.y2 = round((stride * col + 1 + cellsize) / scale);
there comes my second obstacle: assuming the first obstacle solved, at the custom bounding box parser function, the scale being used is unknown.