OpenCV cv::cuda;:CascadeClassifier performance

itaowazard · August 14, 2019, 10:06am

Hi,

I’m trying to implement simple object detection (OpenCV Haar) and because of jetson tx2 platform ability to use CUDA for such kind of processing, to use OpenCV cuda implementation looks like a right way to do. Howether after i have implemented it (both CPU and GPU) i’ve noticed no sufficient performance difference between this approaches (about 200ms for CPU and GPU).

INPUT camera captured image (1280x720)
jetson_clocks and nvpmodel -m 0 are set
OpenCV 3.4.0 build with CUDA support
release build (as i’ve already noticed it is very important for cuda performance)

  CascadeClassifier instances created according example code (provided b OpenCV distrib):

  g_pFaceClassifier = cv::CascadeClassifier(HAAR_CASCADE_FILENAME);               // CPU based
  g_pFaceClassifier = cv::cuda::CascadeClassifier::create(HAAR_CASCADE_FILENAME); //GPU based

HAAR_CASCADE_FILENAME is correct in both ways (different files are used).

  detection calls:
  g_pFaceClassifier.detectMultiScale(gray,found,1.1, 2,  0  | cv::CASCADE_SCALE_IMAGE, cv::Size(32, 32)); //CPU based
  g_pFaceClassifier->detectMultiScale(gpuImg,outImg); //GPU based
  g_pFaceClassifier->convert(outImg, found);          //GPU based requires extra func call to convert output

RESULT:
and as the result it’s both costs about 200ms to process one frame
//i’ve check the CPU usage with htop utility, for CPU based it was 100% usage for all 6 cores (nvpmodel -m 0) and for GPU based it was about 15% for some cores (except the one handles OS and application routine calls)

Does anyone used/tryed OpenCV cuda based CascadeClassifier implemetation for object detection? i would be much appreciate for comment with performance specs (probably i did something wrong).
PS. I’ve also checked sample (cuda::HOG) provided by OpenCV distrib (just need to modify VideoCapture pipeline setup to get stream from onboard camera). And it was about 10FPS for 1280x720 CUDA (and about 4FPS for CPU mode). For me it’s a less than i expected.

Tnx for anyone who can provide any information/advice.

AastaLLL · August 15, 2019, 6:23am

Hi,

Some common issue want to check with you first.

1. Please noticed that the performance script need to be executed in order.

It’s recommended to set the device to power mode first and then lock the clock to the maximal.
Reverse order will reset the clock into dynamic.

sudo nvpmodel -m 0
sudo tegrastats

2. Have you compiled OpenCV with TX2 compute capacity, which should be sm=62?

cmake -D WITH_CUDA=ON -D CUDA_ARCH_BIN="6.2" -D CUDA_ARCH_PTX=""  ...

Thanks.

itaowazard · August 15, 2019, 6:50am

Hi AastaLLL,

AastaLLL:

Hi,

Some common issue want to check with you first.

1. Please noticed that the performance script need to be executed in order.

It’s recommended to set the device to power mode first and then lock the clock to the maximal.
Reverse order will reset the clock into dynamic.
sudo nvpmodel -m 0
sudo tegrastats

Ok, i’ve set device to power mode (nvpmodel - m 0) and got tegrastats output :

RAM 2690/7859MB (lfb 985x4MB) SWAP 0/3929MB (cached 0MB) CPU [23%@2035,51%@2035,43%@2035,15%@2035,14%@2035,17%@2035] EMC_FREQ 0% GR3D_FREQ 31% PLL@52C MCPU@52C PMIC@100C Tboard@45C GPU@54C BCPU@52C thermal@52.7C Tdiode@55C VDD_SYS_GPU 5349/5387 VDD_SYS_SOC 1146/1146 VDD_4V0_WIFI 76/76 VDD_IN 11691/11682 VDD_SYS_CPU 1680/1584 VDD_SYS_DDR 1875/1860
RAM 2690/7859MB (lfb 985x4MB) SWAP 0/3929MB (cached 0MB) CPU [25%@499,45%@2035,50%@2035,13%@499,15%@499,21%@499] EMC_FREQ 0% GR3D_FREQ 27% PLL@51.5C MCPU@51.5C PMIC@100C Tboard@45C GPU@54.5C BCPU@51.5C thermal@53C Tdiode@54.5C VDD_SYS_GPU 5501/5409 VDD_SYS_SOC 1146/1146 VDD_4V0_WIFI 19/64 VDD_IN 11538/11653 VDD_SYS_CPU 1527/1573 VDD_SYS_DDR 1837/1856

not sure for now what parameters are most important to track on for performance lack

My openCV build full cmake:
cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local
-D WITH_CUDA=ON -D CUDA_ARCH_BIN=“6.2” -D CUDA_ARCH_PTX=“”
-D WITH_CUBLAS=ON -D ENABLE_FAST_MATH=ON -D CUDA_FAST_MATH=ON
-D ENABLE_NEON=ON -D WITH_LIBV4L=ON -D BUILD_TESTS=OFF
-D BUILD_PERF_TESTS=OFF -D BUILD_EXAMPLES=OFF
-D WITH_QT=ON -D WITH_OPENGL=ON …

i think it’s should be fine

AastaLLL · August 16, 2019, 3:17am

Hi,

Could you help to profile your application with nvprof first?
It can show you the performance bottleneck.

$ sudo /usr/local/cuda-10.0/bin/nvprof [app]

It looks like the GPU utilization only 20%~30%.
This indicates that kernel might wait for the input source most of the time.

Thanks.

itaowazard · August 16, 2019, 5:35am

Hi AastaLLL,

Tnx for advice

I’ve checked openCV samples (HOG & CascadeClassifier with haarcascade_frontalface_alt.xml) and got profiler output but i’m not sure i can correctly interpret results. Especially because of that i’m not the one who wrote the code. I’ve attached profiler output for both cases if you can/want to view it. Perhaps it is OpenCV’s part to resolve it (i mean performance).
hog_profile.txt (3.58 KB)
haar_profile.txt (6.18 KB)

AastaLLL · August 29, 2019, 9:27am

Hi,

The utilization of GPU doesn’t reach 100%, which indicates there are still some zoom for improvement.
It looks like the cuda kernel is waiting for the data from CPU. Ex. [CUDA memcpy HtoD]

YOu can try to add one more pipeline to keep GPU always busy.
Once the pipeline A is running the algorithm with CUDA kerenl, another pipeline B can start to prepare the data for next frame.

Thanks.

Topic		Replies	Views
[Problem] I cannot create HAAR-based classifier for cv::cuda::CascadeClassifier. Jetson TX1 opencv	8	6367	October 18, 2021
Too slow OPENCV with CUDA compiled, why? Jetson Nano opencv	5	4981	October 18, 2021
does opencv_dnn use gpu? Jetson TX2	11	3101	October 18, 2021
[Performance] I cannot get better performance with OpenCV GPU-accelerated API. Jetson TX1	5	4022	October 18, 2021
Are tools like opencv_traincascade GPU accelerated in OpenCV4Tegra? Jetson TK1 opencv	5	5526	August 8, 2016
Opencv Face Detection Poor Performance with jetson nano Jetson Nano opencv	51	14233	October 14, 2021
Slow performance with opencv at jetson tx2 Jetson TX2	13	3904	October 18, 2021
Extremely slow CUDA API calls? Jetson TX1	6	2891	October 18, 2021
OpenCV application uneven frame times Jetson Xavier NX opencv , performance , opencl	14	2821	January 19, 2022
jetson-inference with OpenCV camera input? Jetson TX2 opencv	14	6215	October 18, 2021

OpenCV cv::cuda;:CascadeClassifier performance

Related topics