Inconsistencies in OpenCV GPU Results

dmoreno · August 13, 2014, 3:53pm

Introduction

Hello, I’m not sure if this is the best place to post this question or if I should be going to the OpenCV forums. Regardless I think this is something people here might at least find interesting. Any insight as to why this is occurring or about what alternatives I could pursue would be greatly appreciated.

I’ve been working on a program that uses OpenCV 2.4.8 in C++ to do real time motion tracking. With the program working the way we want it to, my team is trying to port the OpenCV functions that eat up a lot of processing time to their GPU counterparts. This is all on the NVidia Jetson TK1 board (specs here):

https://developer.nvidia.com/jetson-tk1

In particular I’m interested in the FAST feature detection and the (sparse) Lucas-Kanade method of optical flow calculation using pyramids. Here is the OpenCV documentation on these functions:

CPU FAST feature detection: http://docs.opencv.org/modules/features2d/doc/common_interfaces_of_feature_detectors.html#featuredetector

CPU Optical Flow: http://docs.opencv.org/modules/video/doc/motion_analysis_and_object_tracking.html#calcopticalflowpyrlk

GPU FAST feature detection: http://docs.opencv.org/modules/gpu/doc/feature_detection_and_description.html#gpu-fast-gpu

GPU Optical Flow: http://docs.opencv.org/modules/gpu/doc/video.html#gpu-pyrlkopticalflow-sparse

Problem

While the CPU versions of these functions behave predictably, the GPU implementations’ processing time varies wildly for equivalent input.

Since at first I wasn’t sure if this was just a perceived problem, I ran multiple trials comparing the results of these functions and their CPU counterparts, and wrote some MATLAB scripts to interpret the results. I’d like to point out I understand the GPU versions may be slower than the CPU because of the time it takes to write to device memory, and I’m okay with that. I only need to achieve more consistent processing times, or at least understand why that’s impossible.

About the Tests

All tests were carried out with the same video, using the same initial conditions. The CPU program is compiled with gcc 4.8.2, and the GPU is compiled with nvcc 6.0.1. I haven’t written any CUDA kernels in the GPU code. Other than that, the CPU and GPU versions vary only slightly in ways that are (as best I can tell) irrelevant to the issue I’m having.

My test video runs about 1600 frames and the number of KeyPoints I expected FAST to find varies from around 200 to several thousand.

Examining FAST Feature Detection

I ran 3 trials for the CPU and GPU’s FAST detector. Here I plotted the CPU and the GPU’s standard deviation of processing time versus the frame number:

http://i.imgur.com/klORMre.jpg

The CPU’s standard deviation is on average 0.48 ms. The GPU’s is 2.5 ms. This gives an overall idea of the inconsistencies, but what’s much more interesting is when you look at processing time vs the number of key points the CPU and GPU FAST detectors find:

http://i.imgur.com/c3qzzBo.jpg

The CPU is about what you’d expect - a fairly neat and linear increase in time as the number of KPs increases. Trials are fairly consistent but there is some variance. The GPU version shows:

http://i.imgur.com/3Fbv1CR.jpg

There are 3 distinct (and 1 faint) ‘bands’ (for lack of a better term) that are immediately apparent. All 3 trials populate these bands, and they exist for any number of key points detected. I have no idea what this means, or if I could potentially lock the GPU FAST feature detector into using only one of these bands by killing threads that take too long to process. My application only needs approximately 2-300 key points to run accurately so anything over that is overkill.

Examining Optical Flow Calculation

For Lucas-Kanade optical flow calculation, I took 5 instead of 3 trials. Results from these tests are similar to what I found with the FAST feature detector. The processing time is overall greater, as is the standard deviation, which can be seen here:

http://i.imgur.com/8muOgJl.jpg

Standard deviation for the CPU is on average 0.88 ms. On the GPU this average is 5.7 ms. As I mentioned earlier my application only needs 2-300 key points to run effectively. I filter these key points even further and at this point in the program generally no more than 60 key points will have their optical flow calculated. This means you’ll see an artificial ceiling at 60 key points in my tests.

When you look at processing time vs the number of key points whose optical flow was successfully found, you get results similar to what I found when graphing FAST feature detection time vs key points found. On the CPU you see a more broad but still distinctly linear increase:

http://i.imgur.com/ev1o65h.jpg

And on the GPU you can make out 3 bands, though they are notably more difficult to make out:

http://i.imgur.com/xvq6WWC.jpg

I would expect if I were doing optical flow calculations up to hundreds or thousands of key points, I would see more distinct bands like I did with feature detection.

Conclusion

My goal in posting here is to understand why there is a high variance in processing time, and what (if anything) I can do about it. It just doesn’t make sense to me that the amount of time it takes to write to the GPU, run the calculations, and read back to local memory varies to the degree that I’ve seen. I’m fairly new to the world of parallel computing, so it could very well be there’s something obvious that I just don’t understand at this point.

I can post more results from my tests or parts of my code. Unfortunately, the program is too bulky to show in full.

Thank you everyone in advance for any guidance you can give me.

Munesh · June 4, 2017, 6:02pm

Hi @dmoreno

I understand it has been now more than 2 years you put this question. I am also looking for a fast solution for an optical flow problem using CUDA. Were you able to get the answer to your issue?

Would be great if you have still the code.

Topic		Replies	Views
[Performance] I cannot get better performance with OpenCV GPU-accelerated API. Jetson TX1	5	4105	October 18, 2021
why is FarnebackOpticalFlow (gpu) slower than calcOpticalFlowFarneback (cpu) ? CUDA Programming and Performance	14	5535	March 12, 2024
Unexpected opencv performance on TK1, and CUDA crash Jetson TK1	4	1244	October 18, 2021
OpenCV CUDA Canny is slower than cv::Canny ? Jetson Nano opencv	4	2791	July 2, 2019
Need help in selection of GPU to accelerate opencv cuda implementation of optical flow GPU-Accelerated Libraries	0	782	September 6, 2017
Slow performance with opencv at jetson tx2 Jetson TX2	13	4087	October 18, 2021
TX2: caffe model that runs slower on nvcaffe GPU than on OpenCV CPU Jetson TX2	3	699	October 18, 2021
Performance degradation on CUDA Jetson TX2	10	2359	October 18, 2021
results of cuda example 'stereoDisparity' is better with running another cuda example Jetson TX2	3	618	October 18, 2021
GPU and CPU speed dicrepancy Jetson AGX Xavier cuda	6	778	January 26, 2022

Inconsistencies in OpenCV GPU Results

Related topics