Too slow OPENCV with CUDA compiled, why?

Face detection (haarcascade method) is unbelievable slow. I use opencv compiled with CUDA (using your script: install_opencv4.3.0_Jetson.sh). CUDA is verified (by cv2.getBuildInformation()), but is just 0.1 sec faster than without CUDA (0.3 sec vs 0.4 sec per frame) (3 f/s). Thats is incredible! Raspberry Py 3 performs at about 8f/s. (My Win I7 16G without CUDA, performs at 25f/s). I hope the cause is some my ignored optimization. But is not a cv2. camera.read problem, because without detection (just read and display) timing is 500f/s.
Can you help me? (Jetson Nano memory: 4G + swap 2G)

Having opencv built with CUDA support will only show speedup if you explicitly use opencv cuda functions.

If you’re just calling CPU functions, it won’t be much better (some packages such as TBB or compiler flags may improve in some cases, though).

I cannot tell how much CUDA functions from opencv can be used from python. Someone else may better advise for this.

Hi,

Here are some initial suggestions for you first.

1. Please maximize the device performance with following commands:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

2.
Please noticed that the script you shared doesn’t enable cuDNN since the API change in v8.0.
If you want to build one for cuDNN support, please use this script instead:

Suppose haar doesn’t use cuDNN API. But it’s still worthy to check it.

If the performance doesn’t improve, please share a simple reproducible for us to reproduce.
Thanks.

Reply to NVIDIA forum advice for improving OpenCv on Jetson NANO.

I’m trying to define the use of OpenCv with Jetson NANO. So I spent time to compare the same application using different environments. The application used is substantially a face detecting by HaarCascade method.

Basic statements of python script (application):

webcam = cv2.VideoCapture(0)
face = cv2.CascadeClassifier(“haarcascade_frontalface_alt.xml”)
frame = webcam.read()
frame = cv2.resize(frame, (320, 240)) //or 640x480
frameg = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
faces = face.detectMultiScale(frameg)
cv2.imshow(‘Frame’, frame)

Environments:

W : Windows 10 Intel Core i7-2600K 3.4GHz (8 cpu) mem.: 16 GB NO CUDA
R : Raspberry Pi 3 os-version: jessie 8
JnC : Jetson with OpenCv not compiled with CUDA (as came from JetPack distribution)
JwC1: Jetson with OpenCv compiled with CUDA using: install_opencv4.3.0_Jetson.sh
JwC2: Jetson with OpenCv compiled with CUDA using your last advice: build_opencv.sh

Frame W R JnC JwC1 JwC2
640x480 0.038s (26.3f/s) 0.27s (3.7f/s) 0.15s (6.5f/s) 0.15s (6.6f/s) 0.13s (7.5f/s)
320x240 0.014s (71.5f/s) 0.09s (11.1f/s) 0.055s (18.5f/s) 0.054s (18.5f/s) 0.043s (23f/s)

The use of CUDA was verified by cv2.getBuildInformation() in the script.
I took your advice and I checked clock was at maximum clock (by nvpmodel command), but nothing is changed.

Summarising:

a) Jetson NANO OpenCv CUDA performance is almost similar to performance with NO CUDA. Why?
b) Jetson NANO OpenCv performance is just twice faster than Raspberry Pi 3. It depend just on more efficiency and speed of the 4 Jetson cpu? (because Jetson seems to not utilize CUDA for OpenCv!)

Finally, I report you that every time I had to install OpenCv (with your suggested scripts) I spent about 5 hours for compilation procedure. Why so long time?

In other words, how can I solve this apparent missed use of CUDA accelerator of OpenCv library? Note that OpenCv is very often used in AI programs (particularly for researchers and hobbyists).

This above code does NOT enable CUDA, it still runs on CPU even when OpenCV is compiled with CUDA support. cv2.getBuildInformation() does only confirm if OpenCV was build with CUDA support or not.

Some resources on how to use CUDA-enabled DNN-network in OpenCV:

Note: in most cases you must load the image data to GPU-memory before you can leverage CUDA-acceleration. This overhead for shifting data between CPU- and GPU-memory may impact overall performance, especially when using OpenCV as this is highly optimized code.