OpenCV cv::cuda;:CascadeClassifier performance

Hi,

I’m trying to implement simple object detection (OpenCV Haar) and because of jetson tx2 platform ability to use CUDA for such kind of processing, to use OpenCV cuda implementation looks like a right way to do. Howether after i have implemented it (both CPU and GPU) i’ve noticed no sufficient performance difference between this approaches (about 200ms for CPU and GPU).

INPUT camera captured image (1280x720)
jetson_clocks and nvpmodel -m 0 are set
OpenCV 3.4.0 build with CUDA support
release build (as i’ve already noticed it is very important for cuda performance)

  CascadeClassifier instances created according example code (provided b OpenCV distrib):

  g_pFaceClassifier = cv::CascadeClassifier(HAAR_CASCADE_FILENAME);               // CPU based
  g_pFaceClassifier = cv::cuda::CascadeClassifier::create(HAAR_CASCADE_FILENAME); //GPU based

HAAR_CASCADE_FILENAME is correct in both ways (different files are used).

  detection calls:
  g_pFaceClassifier.detectMultiScale(gray,found,1.1, 2,  0  | cv::CASCADE_SCALE_IMAGE, cv::Size(32, 32)); //CPU based
  g_pFaceClassifier->detectMultiScale(gpuImg,outImg); //GPU based
  g_pFaceClassifier->convert(outImg, found);          //GPU based requires extra func call to convert output

RESULT:
and as the result it’s both costs about 200ms to process one frame
//i’ve check the CPU usage with htop utility, for CPU based it was 100% usage for all 6 cores (nvpmodel -m 0) and for GPU based it was about 15% for some cores (except the one handles OS and application routine calls)

Does anyone used/tryed OpenCV cuda based CascadeClassifier implemetation for object detection? i would be much appreciate for comment with performance specs (probably i did something wrong).
PS. I’ve also checked sample (cuda::HOG) provided by OpenCV distrib (just need to modify VideoCapture pipeline setup to get stream from onboard camera). And it was about 10FPS for 1280x720 CUDA (and about 4FPS for CPU mode). For me it’s a less than i expected.

Tnx for anyone who can provide any information/advice.

Hi,

Some common issue want to check with you first.

1. Please noticed that the performance script need to be executed in order.

It’s recommended to set the device to power mode first and then lock the clock to the maximal.
Reverse order will reset the clock into dynamic.

sudo nvpmodel -m 0
sudo tegrastats

2. Have you compiled OpenCV with TX2 compute capacity, which should be sm=62?

cmake -D WITH_CUDA=ON -D CUDA_ARCH_BIN="6.2" -D CUDA_ARCH_PTX=""  ...

Thanks.

Hi AastaLLL,

Ok, i’ve set device to power mode (nvpmodel - m 0) and got tegrastats output :

RAM 2690/7859MB (lfb 985x4MB) SWAP 0/3929MB (cached 0MB) CPU [23%@2035,51%@2035,43%@2035,15%@2035,14%@2035,17%@2035] EMC_FREQ 0% GR3D_FREQ 31% PLL@52C MCPU@52C PMIC@100C Tboard@45C GPU@54C BCPU@52C thermal@52.7C Tdiode@55C VDD_SYS_GPU 5349/5387 VDD_SYS_SOC 1146/1146 VDD_4V0_WIFI 76/76 VDD_IN 11691/11682 VDD_SYS_CPU 1680/1584 VDD_SYS_DDR 1875/1860
RAM 2690/7859MB (lfb 985x4MB) SWAP 0/3929MB (cached 0MB) CPU [25%@499,45%@2035,50%@2035,13%@499,15%@499,21%@499] EMC_FREQ 0% GR3D_FREQ 27% PLL@51.5C MCPU@51.5C PMIC@100C Tboard@45C GPU@54.5C BCPU@51.5C thermal@53C Tdiode@54.5C VDD_SYS_GPU 5501/5409 VDD_SYS_SOC 1146/1146 VDD_4V0_WIFI 19/64 VDD_IN 11538/11653 VDD_SYS_CPU 1527/1573 VDD_SYS_DDR 1837/1856

not sure for now what parameters are most important to track on for performance lack

My openCV build full cmake:
cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local
-D WITH_CUDA=ON -D CUDA_ARCH_BIN=“6.2” -D CUDA_ARCH_PTX=“”
-D WITH_CUBLAS=ON -D ENABLE_FAST_MATH=ON -D CUDA_FAST_MATH=ON
-D ENABLE_NEON=ON -D WITH_LIBV4L=ON -D BUILD_TESTS=OFF
-D BUILD_PERF_TESTS=OFF -D BUILD_EXAMPLES=OFF
-D WITH_QT=ON -D WITH_OPENGL=ON …

i think it’s should be fine

Hi,

Could you help to profile your application with nvprof first?
It can show you the performance bottleneck.

$ sudo /usr/local/cuda-10.0/bin/nvprof [app]

It looks like the GPU utilization only 20%~30%.
This indicates that kernel might wait for the input source most of the time.

Thanks.

Hi AastaLLL,

Tnx for advice

I’ve checked openCV samples (HOG & CascadeClassifier with haarcascade_frontalface_alt.xml) and got profiler output but i’m not sure i can correctly interpret results. Especially because of that i’m not the one who wrote the code. I’ve attached profiler output for both cases if you can/want to view it. Perhaps it is OpenCV’s part to resolve it (i mean performance).
hog_profile.txt (3.58 KB)
haar_profile.txt (6.18 KB)

Hi,

The utilization of GPU doesn’t reach 100%, which indicates there are still some zoom for improvement.
It looks like the cuda kernel is waiting for the data from CPU. Ex. [CUDA memcpy HtoD]

YOu can try to add one more pipeline to keep GPU always busy.
Once the pipeline A is running the algorithm with CUDA kerenl, another pipeline B can start to prepare the data for next frame.

Thanks.