Too slow OPENCV with CUDA compiled, why?

daniele.denaro.pubblico · September 22, 2020, 12:15pm

Face detection (haarcascade method) is unbelievable slow. I use opencv compiled with CUDA (using your script: install_opencv4.3.0_Jetson.sh). CUDA is verified (by cv2.getBuildInformation()), but is just 0.1 sec faster than without CUDA (0.3 sec vs 0.4 sec per frame) (3 f/s). Thats is incredible! Raspberry Py 3 performs at about 8f/s. (My Win I7 16G without CUDA, performs at 25f/s). I hope the cause is some my ignored optimization. But is not a cv2. camera.read problem, because without detection (just read and display) timing is 500f/s.
Can you help me? (Jetson Nano memory: 4G + swap 2G)

Honey_Patouceul · September 22, 2020, 6:43pm

Having opencv built with CUDA support will only show speedup if you explicitly use opencv cuda functions.

If you’re just calling CPU functions, it won’t be much better (some packages such as TBB or compiler flags may improve in some cases, though).

I cannot tell how much CUDA functions from opencv can be used from python. Someone else may better advise for this.

AastaLLL · September 23, 2020, 3:55am

Hi,

Here are some initial suggestions for you first.

1. Please maximize the device performance with following commands:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

2.
Please noticed that the script you shared doesn’t enable cuDNN since the API change in v8.0.
If you want to build one for cuDNN support, please use this script instead:

Suppose haar doesn’t use cuDNN API. But it’s still worthy to check it.

If the performance doesn’t improve, please share a simple reproducible for us to reproduce.
Thanks.

daniele.denaro.pubblico · October 1, 2020, 12:19am

Reply to NVIDIA forum advice for improving OpenCv on Jetson NANO.

I’m trying to define the use of OpenCv with Jetson NANO. So I spent time to compare the same application using different environments. The application used is substantially a face detecting by HaarCascade method.

Basic statements of python script (application):

webcam = cv2.VideoCapture(0)
face = cv2.CascadeClassifier(“haarcascade_frontalface_alt.xml”)
frame = webcam.read()
frame = cv2.resize(frame, (320, 240)) //or 640x480
frameg = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
faces = face.detectMultiScale(frameg)
cv2.imshow(‘Frame’, frame)

Environments:

W : Windows 10 Intel Core i7-2600K 3.4GHz (8 cpu) mem.: 16 GB NO CUDA
R : Raspberry Pi 3 os-version: jessie 8
JnC : Jetson with OpenCv not compiled with CUDA (as came from JetPack distribution)
JwC1: Jetson with OpenCv compiled with CUDA using: install_opencv4.3.0_Jetson.sh
JwC2: Jetson with OpenCv compiled with CUDA using your last advice: build_opencv.sh

Frame	W	R	JnC	JwC1	JwC2
640x480	0.038s (26.3f/s)	0.27s (3.7f/s)	0.15s (6.5f/s)	0.15s (6.6f/s)	0.13s (7.5f/s)
320x240	0.014s (71.5f/s)	0.09s (11.1f/s)	0.055s (18.5f/s)	0.054s (18.5f/s)	0.043s (23f/s)

The use of CUDA was verified by cv2.getBuildInformation() in the script.
I took your advice and I checked clock was at maximum clock (by nvpmodel command), but nothing is changed.

Summarising:

a) Jetson NANO OpenCv CUDA performance is almost similar to performance with NO CUDA. Why?
b) Jetson NANO OpenCv performance is just twice faster than Raspberry Pi 3. It depend just on more efficiency and speed of the 4 Jetson cpu? (because Jetson seems to not utilize CUDA for OpenCv!)

Finally, I report you that every time I had to install OpenCv (with your suggested scripts) I spent about 5 hours for compilation procedure. Why so long time?

In other words, how can I solve this apparent missed use of CUDA accelerator of OpenCv library? Note that OpenCv is very often used in AI programs (particularly for researchers and hobbyists).

dkreutz · October 1, 2020, 8:47am

This above code does NOT enable CUDA, it still runs on CPU even when OpenCV is compiled with CUDA support. cv2.getBuildInformation() does only confirm if OpenCV was build with CUDA support or not.

Some resources on how to use CUDA-enabled DNN-network in OpenCV:

Note: in most cases you must load the image data to GPU-memory before you can leverage CUDA-acceleration. This overhead for shifting data between CPU- and GPU-memory may impact overall performance, especially when using OpenCV as this is highly optimized code.

Topic		Replies	Views
Very poor Performance with with NVIDIA Jetson Nano 2GB in Face Recognition Jetson Nano python	7	3308	March 28, 2022
Opencv Face Detection Poor Performance with jetson nano Jetson Nano opencv	51	14177	October 14, 2021
CUDA code too slow Jetson Nano cuda	6	1761	July 26, 2022
CUDA is so slow Jetson Nano opencv	5	1293	June 30, 2022
OpenCV works slower in docker container Jetson Nano opencv , jetson-inference , docker	8	2470	December 22, 2021
Jetson Xavier NX Cuda-enabled OpenCV Python Jetson Xavier NX opencv , python	8	2920	October 27, 2021
GPU acceleration for OpenCV ? Jetson Nano	12	7402	October 14, 2021
If the image show and cv2 functions are accelerated by GPU? Jetson Nano	7	1287	October 18, 2021
OpenCV cv::cuda;:CascadeClassifier performance Jetson TX2	6	1865	October 18, 2021
Built OpenCV with CUDA / GPU support. Need to display via gstreamer in python. Jetson Nano	2	3133	October 15, 2021

Too slow OPENCV with CUDA compiled, why?

Related topics