How to parse CRNN model / OCR Text Recognition

• Hardware Platform: amd64
• DeepStream Version: 5.0.0
• TensorRT Version: 7.2.2-1+cuda10.2
• NVIDIA GPU Driver Version (valid for GPU only): 450.51

Hello,

I’m new to DeepStream and I’m trying to use one of the ONNX pre-trained models shared in the OpenCV tutorial TextDetectionModel and TextRecognitionModel.

ONNX Models folder: trained_model_for_text_recognition - Google Drive

I’m able to run the deepstream app using the ONNX model as classifier, it is converted to a TensorRT engine with the following layers:

0 INPUT kFLOAT input 1x32x100
1 OUTPUT kHALF output 1x37

But according to OpenCV documentation:

the output of the text recognition model should be a probability matrix. The shape should be (T, B, Dim) , where

  • T is the sequence length
  • B is the batch size (only support B=1 in inference)
  • and Dim is the length of vocabulary +1(Blank of CTC is at the index=0 of Dim).

Netron visualization of the ONNX network output is float32(26,1,27) : https://imgur.com/2mWcHR9

Apparently DeepStream gets a different output shape, missing the T sequence length.
I’m doing something wrong or the conversion to TensorRT did not performed well?

Tried also to add a custom function to parse the output std::vector<NvDsInferLayerInfo> const &outputLayersInfo but running the pipeline I get this in every frame:

numAttributes = 1
numClasses = 1
layerHeight = 37
layerWidth = 0

Any help to solve this issue or some example to parse the output of similar networks is much appreciated.

Thank you

Hi,

Please run your model with trtexec and check the output dimension first.
For example, with mnist.onnx

$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --dumpOutput

Output:

&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --dumpOutput
[01/29/2021-12:08:17] [I] === Model Options ===
[01/29/2021-12:08:17] [I] Format: ONNX
[01/29/2021-12:08:17] [I] Model: /usr/src/tensorrt/data/mnist/mnist.onnx
[01/29/2021-12:08:17] [I] Output:
...
[01/29/2021-12:08:43] [I] Output Tensors:
[01/29/2021-12:08:43] [I] Plus214_Output_0: (1x10)
[01/29/2021-12:08:43] [I] -1.60912 -0.901603 1.55434 1.57656 -0.0528269 -0.897766 0.831163 0.671341 -0.281258 -0.289106
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --dumpOutput

The output dimension is (1x10).

Thanks.

Hi @AastaLLL

For both ONNX models I gave a try, the output dimensions on trtexec seems to be correct:

[01/29/2021-10:26:33] [I] Output Tensors:
[01/29/2021-10:26:33] [I] output: (26x1x37)
[01/29/2021-10:26:33] [I] -5.94975 -12.1903 -9.81156 -10.347 -11.3837 -11.4158 -12.1832 -12.044 -12.0649 (…)
(…)
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=ResNet_CTC.onnx --dumpOutput

[01/29/2021-10:28:13] [I] Output Tensors:
[01/29/2021-10:28:13] [I] output: (24x1x37)
[01/29/2021-10:28:13] [I] -2.05273 -8.35156 -6.10156 -6.12109 -6.78906 -7.73828 -8.1875 -8.17188 -8.11719 (…)
(…)
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=CRNN_VGG_BiLSTM_CTC_float16.onnx --dumpOutput

Any idea why DeepStream just shows output 1x37 in both models?

Thank you

maybe this post can help you How to get result_label from custom classification parser

output shap of you model must be 1x24x37

Hi @PhongNT,

Thanks a lot for your help.
You mean I need to apply permute/transpose to axis of the original model in order to make it work with deepstream?

I can’t figure why the output shape is correct in TensorRT but is not in DeepStream.

@AastaLLL any thoughs on this?

yes, in my experience

Hi,

You will need a customized parser as suggested by the phongnguyen0812.
The workflow of Deepstream looks like this:

Input → Preprocessing (ex. format) → TensorRT → Output parsing (ex. Tensor to bbox)

So based on the experiment above, the tensor output from TensorRT is correct.
But some issues when parsing the tensor into a final Deepstream output.

Please noted that Deepstream doesn’t have a parser that supports text format.
You will need to implement it on your own:
https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_using_custom_model.html#custom-output-parsing

Thanks.