I am developing a real-time video processing application for the TX1 using opencv. Instead of using opencv4tegra, I have been forced to use opencv 3.1 with CUDA due to the video input stream bug (https://devtalk.nvidia.com/default/topic/929483/jetson-tx1/opencv-videocapture-usb-camera/). During testing, I’ve noticed that opencv4tegra is an order of magnitude faster than opencv 3.1 with CUDA in the following convolution code:
opencv4tegra (~3ms per filter)
cv::Mat frame(360, 640, CV_8UC1);
randu(frame, cv::Scalar::all(0), cv::Scalar::all(255));
cv::Mat element = cv::getStructuringElement(cv::MORPH_RECT, cv::Size(20, 20));
cv::Mat result;
for(int i = 0; i < 100; ++i){
// start benchmark
cv::morphologyEx(frame, result, CV_MOP_CLOSE, element);
//stop benchmark
}
opencv3.1 w/ CUDA (~30ms per filter)
cv::Mat frame(360, 640, CV_8UC1);
randu(frame, cv::Scalar::all(0), cv::Scalar::all(255));
cv::cuda::GpuMat frameGPU;
frameGPU.upload(frame);
cv::Mat element = cv::getStructuringElement(cv::MORPH_RECT, cv::Size(20, 20));
cv::Ptr<cv::cuda::Filter> closeFilter = cv::cuda::createMorphologyFilter(cv::MORPH_CLOSE, CV_8UC1, element);
cv::cuda::GpuMat result;
for(int i = 0; i < 100; ++i){
// start benchmark
closeFilter->apply(frameGPU, result);
// stop benchmark
}
Any idea what could be causing this huge discrepancy? I am accounting for the CUDA initialization time.