Hi,
I have been playing around with the camera, generic CUDA, and the hardware image and video features of the TX1. Great fun so far :)
I’d like to get some numbers showing how well OpenCV is accelerated by the TX1. I know that there is a compile of OpenCV that comes with the JetPack and have that all setup. Unfortunately tools such as the opencv_perf_gpu do not seem to be packaged anywhere with the JetPack installed OpenCV.
I’ve compiled OpenCV 2.4.12 with CUDA support in the hopes to get at opencv_perf_gpu and the metrics it can provide. However, running it with --perf_impl=cuda I get slightly slower performance than --perf_impl=plain (ie, CPU). At the start of a run I see the GPU INFO like repeated twice, once for the GM208 and once for a “Run on OS Linux x32”. The later message makes me feel like the CUDA is being executed on the CPU instead of the GPU.
FWIW the full line to execute is:
$ ./opencv_perf_gpu --gtest_filter=Sz_KernelSz_Fitlers_Filter2D.Filters_Filter2D/69 --perf_impl=cuda
I have also tried doing the nasty and hand compiling opencv_perf_gpu from the source and forcing the use of the provided system headers and libraries. I might try that again. It was only allowing one of the solvePnPRansac tests as I had to play around a bunch to get an implementation into the executable from picking off the right sources etc.
Any hints on this would be great! It is probably best to try to get understandable numbers at the raw OpenCV level before trying to get ROS to take advantage of the hardware.
I don’t think you’ll see an explicit perf_gpu benchmark contained within OpenCV4Tegra, because OpenCV4Tegra transparently implements NEON/GPU acceleration in a way that is ABI-compatible with upstream OpenCV, allowing users to simply re-link their existing CV code and automatically take advantage of the acceleration — thus there isn’t explicit GPU mode to benchmark like OpenCV’s CUDA/GPU module which resides in a different namespace.
Can you try running this script before launching your OpenCV application on TX1? [url]https://github.com/dusty-nv/jetson-scripts/blob/master/jetson_max_l4t.sh[/url] It increases the clock governor limits to maximum.
Thanks for this information! I have rebuild opencv and replaced the created libraries in the build with the ones that came with the TX1. The outcome being that I have all the regular perf tools but they link with the supplied opencv instead of the one I just built. I also used your jetson_max_l4t script before benchmarking. Finally got the fan to come on, though through explicit demand in the script rather than the kernel needing to turn it on.
The optimizations to the tegra-tx1 opencv are interesting. The cvtColor8u::Size_CvtMode::(1920x1080, CV_RGB2YCrCb) test takes 2ms on the TX1 compared with 11ms on an Intel 2600K. In general it’s interesting how well the Tx1 holds itself against a 2600K. Many tests are slower on the TX1, but these are also micro benchmarks and even if GPU boosted the data setup and transfer might be more of a bottleneck than the processing.
Some tests are drastically slower on the TX1, but as with all benchmarking it comes down to what things one is using as to what numbers are most important.