OpenCV dft vs. gpu::dft Performance

Hello, I am testing the OpenCV discrete fourier transform (dft) function on my NVIDIA Jetson TX2 and am wondering why the GPU dft function seems to be running much slower than the CPU version. The code and the output are as shown.

#include <opencv2/core/core.hpp>
#include <opencv2/opencv.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/gpu/gpu.hpp>
#include <opencv2/gpu/gpumat.hpp>
#include <iostream>



int main()
{
        cv::Mat image =cv::imread("test.pgm",CV_LOAD_IMAGE_GRAYSCALE);
        cv::imshow("image",image);

        int height, width;
        height = image.rows;
        width = image.cols;
        cv::Size s = image.size();
        height = s.height;
        width = s.width;

        //Convert to 32-bit floating point
        image.convertTo(image,CV_32FC1);
        
        //GPU-DFT
        int device =cv::gpu::getCudaEnabledDeviceCount();
        std::cout<<"Number of CUDA devices: "<< device << std::endl;
        int getD = cv::gpu::getDevice();
        cv::gpu::setDevice(getD);
        //Get dft size
        int h =cv:: getOptimalDFTSize( image.rows );
        int w =cv:: getOptimalDFTSize( image.cols );
        cv::Size dftsize(h,w);
        cv::Mat sizedimage;
        cv::Mat transform = cv::Mat(h,w/2+1,CV_32FC2);
        //Resize Image
        cv::resize(image,sizedimage,dftsize);
        //Upload image to GpuMat
        cv::gpu::GpuMat gputransform = cv::gpu::GpuMat(h,w/2+1,CV_32FC2);
        cv::gpu::GpuMat gpuimage;
        gpuimage.upload(sizedimage);
        //DFT
        double t = (double)cv::getTickCount();
        cv::gpu::dft(gpuimage,gputransform,sizedimage.size());
        t = ((double)cv::getTickCount() - t)/cv::getTickFrequency();
        std::cout<<"Total time for GPU-DFT: "<<t << std::endl;
        //Download transformed image to CPU
        gputransform.download(transform);

        //CPU-DFT
        cv::Mat cputransform = cv::Mat(h,w/2+1,CV_32FC2);
        double totalcputime = (double)cv::getTickCount();
        cv::dft(sizedimage,cputransform);
        totalcputime = ((double)cv::getTickCount() - totalcputime)/cv::getTickFrequency();
        std::cout<<"Total time for CPU-DFT: "<<totalcputime<<std::endl;
return 0;

Output:
“Number of CUDA devices: 1
Total time for GPU-DFT: 0.717235
Total time for CPU-DFT: 0.0867742”

Problem is solved. I was unaware that the first iteration is the slowest when running the GPU version. Creating a for loop for ten iterations looks like so:

GPU-DFT Time: 0.740611
GPU-DFT Time: 0.0112295
GPU-DFT Time: 0.0141018
GPU-DFT Time: 0.00741959
GPU-DFT Time: 0.0055442
GPU-DFT Time: 0.00472504
GPU-DFT Time: 0.00470632
GPU-DFT Time: 0.0044916
GPU-DFT Time: 0.00450056
GPU-DFT Time: 0.00453567

CPU-DFT Time: 0.0560137
CPU-DFT Time: 0.0518014
CPU-DFT Time: 0.0507693
CPU-DFT Time: 0.0505741
CPU-DFT Time: 0.0504789
CPU-DFT Time: 0.0506245
CPU-DFT Time: 0.0499963
CPU-DFT Time: 0.050419
CPU-DFT Time: 0.0503551
CPU-DFT Time: 0.0503879

Can I take a look at your CMakelists file, please?

I believe all I used to compile was this line in the terminal.

g++ -std=c++11 main.cpp -I/usr/include -lopencv_core -lopencv_gpu -lopencv_highgui -lopencv_imgproc -o OpenCV_dft

Okay thank you
Can you update the changes you made in that for loop, please?

#include <opencv2/core/core.hpp>
#include <opencv2/opencv.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/gpu/gpu.hpp>
#include <opencv2/gpu/gpumat.hpp>
#include <iostream>



int main()
{
        cv::Mat image =cv::imread("test.pgm",CV_LOAD_IMAGE_GRAYSCALE);
        cv::imshow("image",image);

        int height, width;
        height = image.rows;
        width = image.cols;
        cv::Size s = image.size();
        height = s.height;
        width = s.width;

        //Convert to 32-bit floating point
        image.convertTo(image,CV_32FC1);
        
        //GPU-DFT
        int device =cv::gpu::getCudaEnabledDeviceCount();
        std::cout<<"Number of CUDA devices: "<< device << std::endl;
        int getD = cv::gpu::getDevice();
        cv::gpu::setDevice(getD);
        //Get dft size
        int h =cv:: getOptimalDFTSize( image.rows );
        int w =cv:: getOptimalDFTSize( image.cols );
        cv::Size dftsize(h,w);
        cv::Mat sizedimage;
        cv::Mat transform = cv::Mat(h,w/2+1,CV_32FC2);
        //Resize Image
        cv::resize(image,sizedimage,dftsize);
        //Upload image to GpuMat
        cv::gpu::GpuMat gputransform = cv::gpu::GpuMat(h,w/2+1,CV_32FC2);
        cv::gpu::GpuMat gpuimage;
        gpuimage.upload(sizedimage);
        
        for (int i=0;i<10;i++) {
        //DFT
        double t = (double)cv::getTickCount();
        cv::gpu::dft(gpuimage,gputransform,sizedimage.size());
        t = ((double)cv::getTickCount() - t)/cv::getTickFrequency();
        std::cout<<"GPU-DFT: Time"<<t << std::endl;
        }

        //Download transformed image to CPU
        gputransform.download(transform);

        //CPU-DFT
        cv::Mat cputransform = cv::Mat(h,w/2+1,CV_32FC2);
        for (int i=0;i<10;i++) {
        double totalcputime = (double)cv::getTickCount();
        cv::dft(sizedimage,cputransform);
        totalcputime = ((double)cv::getTickCount() - totalcputime)/cv::getTickFrequency();
        std::cout<<"CPU-DFT Time: "<<totalcputime<<std::endl;
        }
return 0;

How can you explain that time improvement?
And how can this be useful in practice if we have to iterate 10 times the dft?