[TensorRT] MTCNN Input_dims valid error - pyramid size

My Goal is Face Recognition, at first find a face and landmarks with MTCNN network.
My MTCNN model is Tensorflow pb file.
I converted it UFF file.

It’s input node name is “pnet/input_image” has a dims[ 3, -1, -1 ] (C x H x W).
Because MTCNN use pyramid scale input, -1 means unknown size.

It’s OK in Tensorflow, but I don’t know How can I use in TensorRT…
I did test with “tensorrt/bin/trtexec”,

Result is,
[E] [TRT] Parameter check failed at: …/builder/Network.cpp::addInput::465, condition: isValidDims(dims)
[E] [TRT] UFFParser: failed to parseInput for node pnet/input_image
[E] [TRT] UffParse: Parser error: pnet/input_image: Failed to parse node - Invalid Tensor found at node pnet/input_image
[E] Engine could not be created

Is it possible “unknown size” in Input_node ?



You can start with our tutorial here:
TF-TRT: https://github.com/NVIDIA-AI-IOT/tf_trt_models
TRT: https://github.com/NVIDIA-AI-IOT/tf_to_trt_image_classification

TensorRT doesn’t support dynamic input, which means the -1 option is not supported.
To avoid this, you can fix your input size and apply down-scaling or up-scaling to meet the requirement.


I do not think this solution is correct.

If I understand the algorithm correctly the neural network operates on a 12 by 12 region with a 3x3 conv. The image passed is scanned with a stride to pass these 12x12 sub regions to the network so each candidate 12x12 region can return a 0.0-1.0 bounding box candidate. Scaling the image to 12x12 would be really bad.

This seems more like we need to add an additional conv layer in front for each of the scaled sets.

The moderator has accepted the scaling reply and believe this to be completly differnt than a fully convoluted layer. It would be much more like subsampling into 12x12 regions and passing as inputs of the sames size in a batch.
So if the current scaled image is 640x480 we sample into 12x12 with a stride of 2:2 and pass as inputs with a batch size of say 320 per row/iteration, 240 times.

I am trying to insert a custom layer to handle this case becuse the “kernel”(which 12x12) region dicatates how the output is interpreted.

I have shared a TensorRT implementation (optimized from trained caffe model files) of MTCNN face detector on GitHub. Feel free to check out the code. I’ll find time to write a blog post about it.


I will have a look on the weekend. I just added the parametrized relu as a plugin.

The one blocker I ran into was that I cannot deserialize a model once saved on windows. I need cross platform and don;t relly want to build model every time for what seems to be a simple issue. Guessing alignment or byte order.

Sorry it took so long for me to have a look but thanks for this!

I too ran into the situation on windows about loading the model.

afaik, caffe models have dynamic dimesnions and tensorrt does not. Creating what they call a “FUlly COnvoluted layer” If you pass in a 1920x1080 frame to caffe and it will do the stage 1 12x12 with stride 2x2 sampleing into the input of the CNN which does a 3x3 kernel on it.

I could not see where the first 12x12 sampling scan of the input is being performed.

I have the same issue with other samples and either I am missing it totally or everyone is playing follow the leader on using a caffe model with tensorrt.

On the bright side, if I am correct, there is potential for a large amount of batching in the first pass.

I pre-computed all scales of a 1280x720 input image and “stacked them up vertically into 1 big image”. So I only need to inference with PNet once. After I implemented this optimization, PNet’s detect() function indeed took much less time to run. However, PNet still occupied the majority of the program running time though.

I explained it in more detail in this blog post, “Optimizing TensorRT MTCNN”: https://jkjung-avt.github.io/optimize-mtcnn/

I have read the material and looked at a lot of implementations and had to back to the original caffemodel code to confirm. i belive this is still missing tbe sliding 12x12 window across each of the scaled images in the image pyrimid.

this person sounds like they actually “got it”