OpenCV dft vs. gpu::dft Performance

AlexMagsam · June 20, 2017, 8:03pm

Hello, I am testing the OpenCV discrete fourier transform (dft) function on my NVIDIA Jetson TX2 and am wondering why the GPU dft function seems to be running much slower than the CPU version. The code and the output are as shown.

#include <opencv2/core/core.hpp>
#include <opencv2/opencv.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/gpu/gpu.hpp>
#include <opencv2/gpu/gpumat.hpp>
#include <iostream>



int main()
{
        cv::Mat image =cv::imread("test.pgm",CV_LOAD_IMAGE_GRAYSCALE);
        cv::imshow("image",image);

        int height, width;
        height = image.rows;
        width = image.cols;
        cv::Size s = image.size();
        height = s.height;
        width = s.width;

        //Convert to 32-bit floating point
        image.convertTo(image,CV_32FC1);
        
        //GPU-DFT
        int device =cv::gpu::getCudaEnabledDeviceCount();
        std::cout<<"Number of CUDA devices: "<< device << std::endl;
        int getD = cv::gpu::getDevice();
        cv::gpu::setDevice(getD);
        //Get dft size
        int h =cv:: getOptimalDFTSize( image.rows );
        int w =cv:: getOptimalDFTSize( image.cols );
        cv::Size dftsize(h,w);
        cv::Mat sizedimage;
        cv::Mat transform = cv::Mat(h,w/2+1,CV_32FC2);
        //Resize Image
        cv::resize(image,sizedimage,dftsize);
        //Upload image to GpuMat
        cv::gpu::GpuMat gputransform = cv::gpu::GpuMat(h,w/2+1,CV_32FC2);
        cv::gpu::GpuMat gpuimage;
        gpuimage.upload(sizedimage);
        //DFT
        double t = (double)cv::getTickCount();
        cv::gpu::dft(gpuimage,gputransform,sizedimage.size());
        t = ((double)cv::getTickCount() - t)/cv::getTickFrequency();
        std::cout<<"Total time for GPU-DFT: "<<t << std::endl;
        //Download transformed image to CPU
        gputransform.download(transform);

        //CPU-DFT
        cv::Mat cputransform = cv::Mat(h,w/2+1,CV_32FC2);
        double totalcputime = (double)cv::getTickCount();
        cv::dft(sizedimage,cputransform);
        totalcputime = ((double)cv::getTickCount() - totalcputime)/cv::getTickFrequency();
        std::cout<<"Total time for CPU-DFT: "<<totalcputime<<std::endl;
return 0;

Output:
“Number of CUDA devices: 1
Total time for GPU-DFT: 0.717235
Total time for CPU-DFT: 0.0867742”

AlexMagsam · June 20, 2017, 8:31pm

Problem is solved. I was unaware that the first iteration is the slowest when running the GPU version. Creating a for loop for ten iterations looks like so:

GPU-DFT Time: 0.740611
GPU-DFT Time: 0.0112295
GPU-DFT Time: 0.0141018
GPU-DFT Time: 0.00741959
GPU-DFT Time: 0.0055442
GPU-DFT Time: 0.00472504
GPU-DFT Time: 0.00470632
GPU-DFT Time: 0.0044916
GPU-DFT Time: 0.00450056
GPU-DFT Time: 0.00453567

CPU-DFT Time: 0.0560137
CPU-DFT Time: 0.0518014
CPU-DFT Time: 0.0507693
CPU-DFT Time: 0.0505741
CPU-DFT Time: 0.0504789
CPU-DFT Time: 0.0506245
CPU-DFT Time: 0.0499963
CPU-DFT Time: 0.050419
CPU-DFT Time: 0.0503551
CPU-DFT Time: 0.0503879

JaTxGPU · July 5, 2017, 1:33pm

Can I take a look at your CMakelists file, please?

AlexMagsam · July 5, 2017, 2:46pm

I believe all I used to compile was this line in the terminal.

g++ -std=c++11 main.cpp -I/usr/include -lopencv_core -lopencv_gpu -lopencv_highgui -lopencv_imgproc -o OpenCV_dft

JaTxGPU · July 6, 2017, 11:36am

Okay thank you
Can you update the changes you made in that for loop, please?

AlexMagsam · July 7, 2017, 4:18pm

#include <opencv2/core/core.hpp>
#include <opencv2/opencv.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/gpu/gpu.hpp>
#include <opencv2/gpu/gpumat.hpp>
#include <iostream>



int main()
{
        cv::Mat image =cv::imread("test.pgm",CV_LOAD_IMAGE_GRAYSCALE);
        cv::imshow("image",image);

        int height, width;
        height = image.rows;
        width = image.cols;
        cv::Size s = image.size();
        height = s.height;
        width = s.width;

        //Convert to 32-bit floating point
        image.convertTo(image,CV_32FC1);
        
        //GPU-DFT
        int device =cv::gpu::getCudaEnabledDeviceCount();
        std::cout<<"Number of CUDA devices: "<< device << std::endl;
        int getD = cv::gpu::getDevice();
        cv::gpu::setDevice(getD);
        //Get dft size
        int h =cv:: getOptimalDFTSize( image.rows );
        int w =cv:: getOptimalDFTSize( image.cols );
        cv::Size dftsize(h,w);
        cv::Mat sizedimage;
        cv::Mat transform = cv::Mat(h,w/2+1,CV_32FC2);
        //Resize Image
        cv::resize(image,sizedimage,dftsize);
        //Upload image to GpuMat
        cv::gpu::GpuMat gputransform = cv::gpu::GpuMat(h,w/2+1,CV_32FC2);
        cv::gpu::GpuMat gpuimage;
        gpuimage.upload(sizedimage);
        
        for (int i=0;i<10;i++) {
        //DFT
        double t = (double)cv::getTickCount();
        cv::gpu::dft(gpuimage,gputransform,sizedimage.size());
        t = ((double)cv::getTickCount() - t)/cv::getTickFrequency();
        std::cout<<"GPU-DFT: Time"<<t << std::endl;
        }

        //Download transformed image to CPU
        gputransform.download(transform);

        //CPU-DFT
        cv::Mat cputransform = cv::Mat(h,w/2+1,CV_32FC2);
        for (int i=0;i<10;i++) {
        double totalcputime = (double)cv::getTickCount();
        cv::dft(sizedimage,cputransform);
        totalcputime = ((double)cv::getTickCount() - totalcputime)/cv::getTickFrequency();
        std::cout<<"CPU-DFT Time: "<<totalcputime<<std::endl;
        }
return 0;

JaTxGPU · July 8, 2017, 9:51am

How can you explain that time improvement?
And how can this be useful in practice if we have to iterate 10 times the dft?

Topic		Replies	Views
Transfer data CPU/GPU is an issue.. Jetson TX2	8	1869	October 18, 2021
OpenCV Cuda DFT extremely slow CUDA Programming and Performance	4	2379	November 24, 2021
Could a GPU significantly speed up this image processing task? CUDA Programming and Performance	5	996	September 21, 2018
Opencv4Tegra GPU vs CPU TK1 vs TX1 Jetson TX1 opencv	3	3669	April 28, 2016
[Performance] I cannot get better performance with OpenCV GPU-accelerated API. Jetson TX1	5	4016	October 18, 2021
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13481	October 27, 2010
Bad Performance of CUFFT library? compilation flags for optimizing fft performance CUDA Programming and Performance	11	13483	February 17, 2012
Why OpenCV thresholding function is slower in GPU than CPU? CUDA Programming and Performance opencv	1	2706	September 17, 2018
does opencv_dnn use gpu? Jetson TX2	11	3095	October 18, 2021
CUDA is so slow Jetson Nano opencv	5	1294	June 30, 2022

OpenCV dft vs. gpu::dft Performance

Related topics