Transfer data CPU/GPU is an issue..

I’m computing an fft of an image using OpenCV GPU but the time that is take to transfer data from CPU to GPU is longer than the fft itself, so the time that is take to do transfer+fft is longer than fft in CPU. I don’t see any advantage of using GPU in this case. Am I right? and how can you solve this issue, I heard about shared memory… Anyone had any similar experience or any idea?

EDIT
This is a sample of my testing code:

#include <stdexcept>
#include "opencv2/imgproc.hpp"
#include "opencv2/highgui.hpp"
#include "opencv2/cudaimgproc.hpp"
#include "opencv2/cudaarithm.hpp"

#include <iostream>

int main (int argc, char *argv[])
{
        cv::Mat image =cv::imread(argv[1],CV_LOAD_IMAGE_GRAYSCALE);

        int height, width;
        height = image.rows;
        width = image.cols;
        cv::Size s = image.size();
        height = s.height;
        width = s.width;

        //Convert to 32-bit floating point
        image.convertTo(image,CV_32FC1);
        
        //GPU-DFT
        int device =cv::cuda::getCudaEnabledDeviceCount();
        std::cout<<"Number of CUDA devices: "<< device << std::endl;
        int getD = cv::cuda::getDevice();
        cv::cuda::setDevice(getD);
        //Get dft size
        int h =cv:: getOptimalDFTSize( image.rows );
        int w =cv:: getOptimalDFTSize( image.cols );
        cv::Size dftsize(h,w);
        cv::Mat sizedimage;
        cv::Mat transform = cv::Mat(h,w/2+1,CV_32FC2);
        //Resize Image
        cv::resize(image,sizedimage,dftsize);
        //Upload image to GpuMat
        cv::cuda::GpuMat gputransform = cv::cuda::GpuMat(h,w/2+1,CV_32FC2);
        cv::cuda::GpuMat gpuimage;
        gpuimage.upload(sizedimage);
        //DFT
        for(int i=0; i<3;i++)
        {
        double t = (double)cv::getTickCount();
        cv::cuda::dft(gpuimage,gputransform,sizedimage.size());
        t = ((double)cv::getTickCount() - t)/cv::getTickFrequency();
        std::cout<<"Total time for GPU-DFT: "<<t << std::endl;
        }
        //Download transformed image to CPU
        gputransform.download(transform);

        //CPU-DFT
        cv::Mat cputransform = cv::Mat(h,w/2+1,CV_32FC2);
        double totalcputime = (double)cv::getTickCount();
        cv::dft(sizedimage,cputransform);
        totalcputime = ((double)cv::getTickCount() - totalcputime)/cv::getTickFrequency();
        std::cout<<"\nTotal time for CPU-DFT: "<<totalcputime<<std::endl;
return 0;
}

And that’s what I get:

Number of CUDA devices: 1
Total time for GPU-DFT: 1.08107
Total time for GPU-DFT: 0.0637337
Total time for GPU-DFT: 0.0400113

Total time for CPU-DFT: 0.785276

HI JaTxGPU,

Could you share the sample you are using? Are you using opencv4tegra?

@WayneWWW I’ve edited my question. No I’m not using opencv4tegra. There was an issue with that library in jetson and I just noticed that it is solved few days ago. I’ll install it today

Could you try the CUDA-8.0 samples-CUFFT samples and see if the operations are still slow?

The one that uses random data?
As far as I know, the issue is that it is not easy to use opencv and cuda in the same code!

I think one problem here is that we don’t guarantee the FFT performance in opencv. We have cuFFT library and it is still maintained.
Do you raise the gpu clk to maximum?

So how can I use cuFFT with my image?

Another thing to realize is that “throughput” and “latency” are different things.
In high-performance, high-throughput, GPU-accelerated computing, you will typically have multiple pieces of code working on multiple parts of the problem at the same time.
The GPU might be working on sample N.
The CPU should then be working on dealing with the results of sample N-1, as well as working on preparing sample N+1.
If your code is fully serial: Prepare sample, GPU process sample, Work on results of sample; repeat; then you will have much lower throughput, because you force each piece of the pipeline to be idle while the other piece is working, and there is no overlapping work to soak up the latency.