OpenCV gpu modules performs slowly in TK1

Hi there.
I’m using GPU modules in OpenCV to accelerate a image processing program in Jetson TK1. However, I have found that the performance becomes even worse.
For example, it takes 3.41ms for cv::erode() function to process an 868*600 pixels image, but it takes almost 6ms for cv::gpu::erode() to process the same image including 0.6ms for GpuMat upload() and 1ms for GpuMat download(). I have initialized CUDA before this test but it doesn’t work.
I don’t know how to do know. Could someone help me? Many Thanks.

Is the processing multi-threaded? A single-threaded application will not perform better on a GPU vs. an ARM processor (or other CPUs).

Sorry for the late reply.
I don’t know how the “multi-threaded application” defines. I ran the same program on my laptop using Core i5 and GTX 850M and the result is that GPU version is faster than CPU version.
I don’t know how to distinguish that if my program is multi-threaded or not. Wish you to tell me more detail about this.
Many thanks.

It’s actually normal that you have a worse performance with morphology functions when you are using GPU instead of CPU. In my experience, GPU produces better performance with BackgroundSubtractor (really good) and threshold, but really bad at MorphologyEx