I am running a code for binarization of an image using OpenCV in C++.
This is the output for generic OpenCV that runs on the CPU -
And this is the output for OpenCV with CUDA support -
Ideally shouldn’t the GPU be faster than CPU? What can be the reason for such drastic increase in time?
Could you verify if the implementation really uses GPU first?
You can monitor the GPU utilization with
tegrastats tool as below:
$ sudo tegrastats
This is the output of running tegrastats
Based on your screenshot, the GPU utilization is 0%. (GR3D_FREQ 0%@ 420)
So there is no task running on GPU.
Please double-check the CUDA implementation first.
I ran the following code, with CUDA support for OpenCV -
using namespace cv;
using namespace std;
int main (int argc, char* argv)
int64 work_begin = getTickCount();
cv::cuda::GpuMat d_result, d_img;
//open the Webcam
// if not success, exit program
if (cap.isOpened() == false)
cout << "Cannot open Webcam" << endl;
//get the frames rate of the video from webcam
double frames_per_second = cap.get(CAP_PROP_FPS);
cout << "Frames per seconds : " << frames_per_second << endl;
cout<<"Press Q to Quit" <<endl;
String win_name = "Webcam Video";
namedWindow(win_name); //create a window
bool flag = cap.read(frame); // read a new frame from video
cv::cuda::threshold(d_img, d_result, 128.0, 255.0, cv::THRESH_BINARY);
//show the frame in the created window
imshow("Binary image", h_result);
//Measure difference in time ticks
int64 delta = getTickCount() - work_begin;
double freq = getTickFrequency();
//Measure frames per second
double work_fps = freq / delta;
std::cout <<"Performance of Thresholding on GPU: " <<std::endl;
std::cout <<"Time: " << (1/work_fps) <<std::endl;
std::cout <<"FPS: " <<work_fps <<std::endl;
if (waitKey(1) == 'q')
With the following output for tegrastats -
If the OpenCV code is not utilising the GPU, is there a way to ensure that it does?
Based on the second output you shared, OpenCV does use GPU but does not fully utilize it.
You can find in some slots, the GR3D_FREQ score raised to 36~38%.
That’s because the really GPU implementation is only the
However, there is a memory copy before (
d_img.upload) and after (
d_result.download) the CUDA task.
Thresholding is a relatively fast job so the data transfer might be the bottleneck of your use case.
Although the GPU accelerates the thresholding job, the memory copy is an extra cost to run a task on GPU.
That’s might be the reason you cannot see an obvious improvement on GPU.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.