opencv4tegra vs. opencv3.1 w/ cuda

I am developing a real-time video processing application for the TX1 using opencv. Instead of using opencv4tegra, I have been forced to use opencv 3.1 with CUDA due to the video input stream bug (https://devtalk.nvidia.com/default/topic/929483/jetson-tx1/opencv-videocapture-usb-camera/). During testing, I’ve noticed that opencv4tegra is an order of magnitude faster than opencv 3.1 with CUDA in the following convolution code:

opencv4tegra (~3ms per filter)

cv::Mat frame(360, 640, CV_8UC1);
randu(frame, cv::Scalar::all(0), cv::Scalar::all(255));

cv::Mat element = cv::getStructuringElement(cv::MORPH_RECT, cv::Size(20, 20));
cv::Mat result;
for(int i = 0; i < 100; ++i){
  // start benchmark
  cv::morphologyEx(frame, result, CV_MOP_CLOSE, element);
  //stop benchmark
}

opencv3.1 w/ CUDA (~30ms per filter)

cv::Mat frame(360, 640, CV_8UC1);
randu(frame, cv::Scalar::all(0), cv::Scalar::all(255));
cv::cuda::GpuMat frameGPU;
frameGPU.upload(frame);

cv::Mat element = cv::getStructuringElement(cv::MORPH_RECT, cv::Size(20, 20));
cv::Ptr<cv::cuda::Filter> closeFilter = cv::cuda::createMorphologyFilter(cv::MORPH_CLOSE, CV_8UC1, element);
cv::cuda::GpuMat result;
for(int i = 0; i < 100; ++i){
  // start benchmark
  closeFilter->apply(frameGPU, result);
  // stop benchmark
}

Any idea what could be causing this huge discrepancy? I am accounting for the CUDA initialization time.

It may depend heavily on the build options. I would advise to get the build options of OpenCV4Tegra (there is a function that gives the options used to build it). Skip the definition HAVE_TEGRA_OPTIMIZATION, as it requires non open source files.
Check also for processor flags (get your target processor features with the flags of /proc/cpuinfo) and enable them in the cmake config.