Is OpenCV really using the GPU for detection?

Hi developers!
Recently I noticed something strange when I am running a python script for inference with my custom YOLO v4-tiny model with the cv2.dnn.readNetFromDarknet(). The program works just fine but at low FPS, as I see in other videos in Youtube, this is normal while detecting objects with YOLO.
At the time I checked the JTOP monitor(exclusive developed app for the Jetson Nano) while running my program it gave me the following results:


From what I think is happening is that the program is only using the four cores of the CPU instead of the GPU.
Just take a look in the GPU tab of JTOP:

It seems that CUDA cores are barely working. And in the CPU tab is a whole different story, all cores working with 95 % in average:

Two thing may be happening: Or JTOP is not trustworthy or GPU is lazing around letting the CPU take all the work. I have compiled OpenCV to work with CUDA:

The BIS parameter is just the blob image size that I can change in real time from 32x13=416 to 32
I’ve compiled my OpenCV with this guide: Install OpenCV 4.5 on Jetson Nano - Q-engineering
Is there something else that I need to install or write down in my python script in order to use OpenCV CUDA accelerated or I am already using it?
Thank you in advance for any reply

Hi,

To use CUDA version DNN module, have you built OpenCV with the following configuration?

$ cmake -D CMAKE_BUILD_TYPE=RELEASE \
    ... \
    -D WITH_CUDA=ON \
    -D WITH_CUDNN=ON \
    -D OPENCV_DNN_CUDA=ON \
    -D CUDA_ARCH_BIN=5.3 \
    -D WITH_CUBLAS=1 \
    ...

Thanks

Hello AastaLLL!
I used this configuration for cmake

$ cmake -D CMAKE_BUILD_TYPE=RELEASE
-D CMAKE_INSTALL_PREFIX=/usr
-D OPENCV_EXTRA_MODULES_PATH=~/opencv_contrib/modules
-D EIGEN_INCLUDE_PATH=/usr/include/eigen3
-D WITH_OPENCL=OFF
-D WITH_CUDA=ON
-D CUDA_ARCH_BIN=5.3
-D CUDA_ARCH_PTX=“”
-D WITH_CUDNN=ON
-D WITH_CUBLAS=ON
-D ENABLE_FAST_MATH=ON
-D CUDA_FAST_MATH=ON
-D OPENCV_DNN_CUDA=ON
-D ENABLE_NEON=ON
-D WITH_QT=ON
-D WITH_OPENMP=ON
-D WITH_OPENGL=ON
-D BUILD_TIFF=ON
-D WITH_FFMPEG=ON
-D WITH_GSTREAMER=ON
-D WITH_TBB=ON
-D BUILD_TBB=ON
-D BUILD_TESTS=OFF
-D WITH_EIGEN=ON
-D WITH_V4L=ON
-D WITH_LIBV4L=ON
-D OPENCV_ENABLE_NONFREE=ON
-D INSTALL_C_EXAMPLES=OFF
-D INSTALL_PYTHON_EXAMPLES=OFF
-D BUILD_NEW_PYTHON_SUPPORT=ON
-D BUILD_opencv_python3=TRUE
-D OPENCV_GENERATE_PKGCONFIG=ON
-D BUILD_EXAMPLES=OFF …

Hi,

It seems that your GPU utilization is 30%.
Do you run any other GPU application at the same time?

If not, OpenCV may still use GPU for inference but not well-optimized.

Thanks.

I am only running only one python script or proccess at the time. So this means that the dnn module of OpenCV is not optimized to work with Jetson Nano GPU or is JTOP showing inaccurate information about the GPU work proccess

Hi,

It’s more likely the OpenCV optimized issue.

To confirm this, would you mind to evaluate the app with our profiler?
It can show you if any GPU API is used directly.

$ sudo /usr/local/cuda-10.2/bin/nvprof python3 [app.py]

Thanks.

Good morning

Few minutes ago I’ve placed my python file in /home/my_user and when I run that command I get:

[ WARN:0] global /home/redeye/opencv/modules/videoio/src/cap_gstreamer.cpp (961) open OpenCV | GStreamer warning: Cannot query video position: status=0, value=-1, duration=-1
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to ‘/tmp/runtime-root’

Actual Status

Detection:No object to track…
Current frame analysis took 0.939 seconds
Command:OOOOOSOO

Actual Status

Detection:No object to track…
Current frame analysis took 0.748 seconds
Command:OOOOOSOO

Actual Status

Detection:No object to track…
Current frame analysis took 0.699 seconds
Command:OOOOOSOO

Actual Status

Detection:No object to track…
Current frame analysis took 0.685 seconds
Command:OOOOOSOO

Actual Status

Detection:No object to track…
Current frame analysis took 0.695 seconds
Command:OOOOOSOO

Actual Status

Detection:No object to track…
Current frame analysis took 0.709 seconds
Command:OOOOOSOO

Actual Status

Detection:No object to track…
Current frame analysis took 0.695 seconds
Command:OOOOOSOO

Traceback (most recent call last):
File “Test_Platform-2.py”, line 407, in
serialArduino.write(order.encode(‘ascii’))
NameError: name ‘serialArduino’ is not defined
======== Warning: No CUDA application was profiled, exiting
======== Error: Application returned non-zero code 1

Only the last two lines are the output from nvprof profiler
So this means that my script is not really using the GPU for detection?

Hi,

Thanks for your testing.

We are preparing OpenCV environment to check it further.
Will share more information with you later.

1 Like

Hi,

Test with below command.
We confirmed that the example doesn’t use GPU.

$ python3 object_detection.py --config=yolov4-tiny.cfg --model=yolov4-tiny.weights --classes=../data/dnn/object_detection_classes_coco.txt --width=416 --height=416 --scale=0.00392 --input=/opt/nvidia/deepstream/deepstream-5.1/samples/streams/sample_1080p_h264.mp4 --rgb

It seems to vary from the model type or the model format you used.
For object detection, the supported target platform doesn’t have GPU:

Thanks

Hi there.
Few moments ago I’ve finally discovered my solution. Now my script is using CUDA cores!
Before the modification I had 1.25 fps average with 416x416 input blob feed and now is running at 6.5 fps at the same input size. Big improvement considering that is an edge device.
In order to truly use CUDA cores is needed to add the following two lines after
net = cv2.dnn.readNetFromDarknet(cfgPath,weightsPath)

net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

I did not know that this two lines where needed to properly work with GPU
This is the page that pointed me in right direction: How to use OpenCV DNN Module with NVIDIA GPUs on Linux

And this are my current GPU graphs in JTOP app:

And CPU not taking all the math like before:

Even so, many thanks to @AastaLLL for all the replies and taking your time to solve my issue.

Something I do not understand is why the learnopencv blog uses argparse method in their python script to activate CUDA, can someone explain that?
This are fragments of the script:

import argparse

parser = argparse.ArgumentParser(description=‘Run keypoint detection’)
parser.add_argument(“–device”, default=“cpu”, help=“Device to inference on”)

net = cv2.dnn.readNetFromCaffe(protoFile, weightsFile)

if args.device == “cpu”:
net.setPreferableBackend(cv2.dnn.DNN_TARGET_CPU)
print(“Using CPU device”)
elif args.device == “gpu”:
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
print(“Using GPU device”)

Thanks

Hi,

Thanks for sharing this information with us.

The script is just trying to support both CPU and GPU mode.
You can launch the script with python3 [app.py] --device cpu to deploy a model on CPU.
And python3 [app.py] --device gpu for GPU case.

Thanks.

1 Like