Inconsistencies in OpenCV GPU Results

Introduction

Hello, I’m not sure if this is the best place to post this question or if I should be going to the OpenCV forums. Regardless I think this is something people here might at least find interesting. Any insight as to why this is occurring or about what alternatives I could pursue would be greatly appreciated.

I’ve been working on a program that uses OpenCV 2.4.8 in C++ to do real time motion tracking. With the program working the way we want it to, my team is trying to port the OpenCV functions that eat up a lot of processing time to their GPU counterparts. This is all on the NVidia Jetson TK1 board (specs here):

https://developer.nvidia.com/jetson-tk1

In particular I’m interested in the FAST feature detection and the (sparse) Lucas-Kanade method of optical flow calculation using pyramids. Here is the OpenCV documentation on these functions:

  • CPU FAST feature detection: http://docs.opencv.org/modules/features2d/doc/common_interfaces_of_feature_detectors.html#featuredetector
  • CPU Optical Flow: http://docs.opencv.org/modules/video/doc/motion_analysis_and_object_tracking.html#calcopticalflowpyrlk
  • GPU FAST feature detection: http://docs.opencv.org/modules/gpu/doc/feature_detection_and_description.html#gpu-fast-gpu
  • GPU Optical Flow: http://docs.opencv.org/modules/gpu/doc/video.html#gpu-pyrlkopticalflow-sparse
  • Problem

    While the CPU versions of these functions behave predictably, the GPU implementations’ processing time varies wildly for equivalent input.

    Since at first I wasn’t sure if this was just a perceived problem, I ran multiple trials comparing the results of these functions and their CPU counterparts, and wrote some MATLAB scripts to interpret the results. I’d like to point out I understand the GPU versions may be slower than the CPU because of the time it takes to write to device memory, and I’m okay with that. I only need to achieve more consistent processing times, or at least understand why that’s impossible.

    About the Tests

    All tests were carried out with the same video, using the same initial conditions. The CPU program is compiled with gcc 4.8.2, and the GPU is compiled with nvcc 6.0.1. I haven’t written any CUDA kernels in the GPU code. Other than that, the CPU and GPU versions vary only slightly in ways that are (as best I can tell) irrelevant to the issue I’m having.

    My test video runs about 1600 frames and the number of KeyPoints I expected FAST to find varies from around 200 to several thousand.

    Examining FAST Feature Detection

    I ran 3 trials for the CPU and GPU’s FAST detector. Here I plotted the CPU and the GPU’s standard deviation of processing time versus the frame number:

    http://i.imgur.com/klORMre.jpg

    The CPU’s standard deviation is on average 0.48 ms. The GPU’s is 2.5 ms. This gives an overall idea of the inconsistencies, but what’s much more interesting is when you look at processing time vs the number of key points the CPU and GPU FAST detectors find:

    http://i.imgur.com/c3qzzBo.jpg

    The CPU is about what you’d expect - a fairly neat and linear increase in time as the number of KPs increases. Trials are fairly consistent but there is some variance. The GPU version shows:

    http://i.imgur.com/3Fbv1CR.jpg

    There are 3 distinct (and 1 faint) ‘bands’ (for lack of a better term) that are immediately apparent. All 3 trials populate these bands, and they exist for any number of key points detected. I have no idea what this means, or if I could potentially lock the GPU FAST feature detector into using only one of these bands by killing threads that take too long to process. My application only needs approximately 2-300 key points to run accurately so anything over that is overkill.

    Examining Optical Flow Calculation

    For Lucas-Kanade optical flow calculation, I took 5 instead of 3 trials. Results from these tests are similar to what I found with the FAST feature detector. The processing time is overall greater, as is the standard deviation, which can be seen here:

    http://i.imgur.com/8muOgJl.jpg

    Standard deviation for the CPU is on average 0.88 ms. On the GPU this average is 5.7 ms. As I mentioned earlier my application only needs 2-300 key points to run effectively. I filter these key points even further and at this point in the program generally no more than 60 key points will have their optical flow calculated. This means you’ll see an artificial ceiling at 60 key points in my tests.

    When you look at processing time vs the number of key points whose optical flow was successfully found, you get results similar to what I found when graphing FAST feature detection time vs key points found. On the CPU you see a more broad but still distinctly linear increase:

    http://i.imgur.com/ev1o65h.jpg

    And on the GPU you can make out 3 bands, though they are notably more difficult to make out:

    http://i.imgur.com/xvq6WWC.jpg

    I would expect if I were doing optical flow calculations up to hundreds or thousands of key points, I would see more distinct bands like I did with feature detection.

    Conclusion

    My goal in posting here is to understand why there is a high variance in processing time, and what (if anything) I can do about it. It just doesn’t make sense to me that the amount of time it takes to write to the GPU, run the calculations, and read back to local memory varies to the degree that I’ve seen. I’m fairly new to the world of parallel computing, so it could very well be there’s something obvious that I just don’t understand at this point.

    I can post more results from my tests or parts of my code. Unfortunately, the program is too bulky to show in full.

    Thank you everyone in advance for any guidance you can give me.

    Hi @dmoreno

    I understand it has been now more than 2 years you put this question. I am also looking for a fast solution for an optical flow problem using CUDA. Were you able to get the answer to your issue?

    Would be great if you have still the code.