I’m trying to implement simple object detection (OpenCV Haar) and because of jetson tx2 platform ability to use CUDA for such kind of processing, to use OpenCV cuda implementation looks like a right way to do. Howether after i have implemented it (both CPU and GPU) i’ve noticed no sufficient performance difference between this approaches (about 200ms for CPU and GPU).
INPUT camera captured image (1280x720)
jetson_clocks and nvpmodel -m 0 are set
OpenCV 3.4.0 build with CUDA support
release build (as i’ve already noticed it is very important for cuda performance)
CascadeClassifier instances created according example code (provided b OpenCV distrib):
g_pFaceClassifier = cv::CascadeClassifier(HAAR_CASCADE_FILENAME); // CPU based
g_pFaceClassifier = cv::cuda::CascadeClassifier::create(HAAR_CASCADE_FILENAME); //GPU based
HAAR_CASCADE_FILENAME is correct in both ways (different files are used).
detection calls:
g_pFaceClassifier.detectMultiScale(gray,found,1.1, 2, 0 | cv::CASCADE_SCALE_IMAGE, cv::Size(32, 32)); //CPU based
g_pFaceClassifier->detectMultiScale(gpuImg,outImg); //GPU based
g_pFaceClassifier->convert(outImg, found); //GPU based requires extra func call to convert output
RESULT:
and as the result it’s both costs about 200ms to process one frame
//i’ve check the CPU usage with htop utility, for CPU based it was 100% usage for all 6 cores (nvpmodel -m 0) and for GPU based it was about 15% for some cores (except the one handles OS and application routine calls)
Does anyone used/tryed OpenCV cuda based CascadeClassifier implemetation for object detection? i would be much appreciate for comment with performance specs (probably i did something wrong).
PS. I’ve also checked sample (cuda::HOG) provided by OpenCV distrib (just need to modify VideoCapture pipeline setup to get stream from onboard camera). And it was about 10FPS for 1280x720 CUDA (and about 4FPS for CPU mode). For me it’s a less than i expected.
Tnx for anyone who can provide any information/advice.
I’ve checked openCV samples (HOG & CascadeClassifier with haarcascade_frontalface_alt.xml) and got profiler output but i’m not sure i can correctly interpret results. Especially because of that i’m not the one who wrote the code. I’ve attached profiler output for both cases if you can/want to view it. Perhaps it is OpenCV’s part to resolve it (i mean performance). hog_profile.txt (3.58 KB) haar_profile.txt (6.18 KB)
The utilization of GPU doesn’t reach 100%, which indicates there are still some zoom for improvement.
It looks like the cuda kernel is waiting for the data from CPU. Ex. [CUDA memcpy HtoD]
YOu can try to add one more pipeline to keep GPU always busy.
Once the pipeline A is running the algorithm with CUDA kerenl, another pipeline B can start to prepare the data for next frame.