Opencv2 inbuilt face recognition CNN model around 500 ms (poor performance) in Jetson Nano.. Help!

Hello Experts,

CC: @Honey_Patouceul @DaneLLL @Amycao @kayccc @icornejo.a @AastaLLL @dusty_nv @forumuser @Jeffli @fpsychosis

I am using python to run the face recognition using Jetson Nano Evaluation board Jetpack 4.4

I already compiled the dlib with CUDA enabled and installed.

When I try to profile an in-built CNN face recognition model, it takes around 500 ms which I feel very high for the powerful GPU for a simple 720 p image.

1 Like

Hi,

Would you mind to check the GPU utilization with tegrastats first?
To reach an optimal performance, it’s expected to see ~99% GPU utilization.

$ sudo tegrastats

If I remember correctly, the built-in inference in OpenCV is a CPU implementation rather than GPU.
You will need to build it with extra source (opencv_contrib) to get the GPU support.

Here is a tutorial of building OpenCV from source for your reference:

Thanks.

hi techguyz:
1:just @AastaLLL said ,make sure GPU is utilized.
2:for fully speedup your model inference time , you should use tensorRT to optimize your model to create the TRT engine model which is be preferred on running on GPU

Hi @AastaLLL

Pls find the tegrastats logs observed. It seems GPU is already utilized fully.

60/6205 POM_5V_GPU 2432/2655 POM_5V_CPU 1144/1200
RAM 2731/3956MB (lfb 2x4MB) SWAP 530/1978MB (cached 38MB) CPU [63%@1479,15%@1479,6%@1479,42%@1479] EMC_FREQ 0% GR3D_FREQ 99% PLL@47.5C CPU@51C PMIC@100C GPU@51.5C AO@55.5C thermal@51C POM_5V_IN 6403/6244 POM_5V_GPU 3106/2745 POM_5V_CPU 1107/1182
RAM 2734/3956MB (lfb 2x4MB) SWAP 530/1978MB (cached 38MB) CPU [45%@1479,15%@1479,27%@1479,48%@1479] EMC_FREQ 0% GR3D_FREQ 99% PLL@47.5C CPU@51C PMIC@100C GPU@51C AO@56C thermal@51.25C POM_5V_IN 6046/6211 POM_5V_GPU 2499/2704 POM_5V_CPU 1144/1175
RAM 2731/3956MB (lfb 2x4MB) SWAP 530/1978MB (cached 38MB) CPU [34%@1479,29%@1479,14%@1479,44%@1479] EMC_FREQ 0% GR3D_FREQ 99% PLL@47.5C CPU@51C PMIC@100C GPU@51.5C AO@55.5C thermal@51.25C POM_5V_IN 6368/6234 POM_5V_GPU 2856/2726 POM_5V_CPU 1073/1161
RAM 2732/3956MB (lfb 2x4MB) SWAP 530/1978MB (cached 38MB) CPU [50%@1479,37%@1479,27%@1479,34%@1479] EMC_FREQ 0% GR3D_FREQ 99% PLL@47.5C CPU@51.5C PMIC@100C GPU@51.5C AO@56C thermal@51.5C POM_5V_IN 6200/6229 POM_5V_GPU 2683/2720 POM_5V_CPU 1218/1168
RAM 2732/3956MB (lfb 2x4MB) SWAP 530/1978MB (cached 38MB) CPU [59%@1479,15%@1479,43%@1479,12%@1479] EMC_FREQ 0% GR3D_FREQ 99% PLL@47.5C CPU@51C PMIC@100C GPU@51.5C AO@56C thermal@51.25C POM_5V_IN 6403/6249 POM_5V_GPU 3106/2763 POM_5V_CPU 1071/1157
RAM 2732/3956MB (lfb 2x4MB) SWAP 530/1978MB (cached 38MB) CPU [52%@1479,18%@1479,37%@1479,20%@1479] EMC_FREQ 0% GR3D_FREQ 99% PLL@47.5C CPU@51.5C PMIC@100C GPU@51C AO@56C thermal@52.25C POM_5V_IN 6368/6260 POM_5V_GPU 2964/2783 POM_5V_CPU 1001/1141
RAM 2734/3956MB (lfb 2x4MB) SWAP 530/1978MB (cached 38MB) CPU [50%@1479,13%@1479,32%@1479,43%@1479] EMC_FREQ 0% GR3D_FREQ 99% PLL@48C CPU@51C PMIC@100C GPU@52C AO@56C thermal@51.5C POM_5V_IN 6153/6

Hello @Jeffli

As the model comes along with opencv is there any guide on how to specifically load, optimize and apply back to opencv

As I initially installed dlib without CUDA support, then while enabling the CNN model leads to hang of device.

hi techguyz:
this face recognize model (facenet) seems come from torch(t7 format),so there are some more work for you if you want to speed up it.
1:try transfer this t7 file to ONNX, TensorRT support ONNX, here is a reference guide, but I am not sure whether it can work correctly

2:using TensorRT transfer ONNX file to tensorrt engine file, also this is not easy step
3:if you could successfully get engine file (model optimized via tensorrt), using C++ interface of tensorrt to intergrated into your Opencv code ,loading your engine file ,then inferenced from this engine file
this is guide of C++ interface link of TRT

all of these steps are not so easily done , only way is try it and try it if you want fully using GPU capacity

Hi @Jeffli

This looks bit complex and my torch to ONNX didnt work as expected.

Any other quick hack, I am ok to use new model as long as good performance.

hi techguyz:
the easy way is found some models that already support TRT, then you just transfer it to TRT engine file, otherwise you should speedup model with TensorRT by yourself