Transfer data CPU/GPU is an issue..

JaTxGPU · July 29, 2017, 5:14pm

I’m computing an fft of an image using OpenCV GPU but the time that is take to transfer data from CPU to GPU is longer than the fft itself, so the time that is take to do transfer+fft is longer than fft in CPU. I don’t see any advantage of using GPU in this case. Am I right? and how can you solve this issue, I heard about shared memory… Anyone had any similar experience or any idea?

EDIT
This is a sample of my testing code:

#include <stdexcept>
#include "opencv2/imgproc.hpp"
#include "opencv2/highgui.hpp"
#include "opencv2/cudaimgproc.hpp"
#include "opencv2/cudaarithm.hpp"

#include <iostream>

int main (int argc, char *argv[])
{
        cv::Mat image =cv::imread(argv[1],CV_LOAD_IMAGE_GRAYSCALE);

        int height, width;
        height = image.rows;
        width = image.cols;
        cv::Size s = image.size();
        height = s.height;
        width = s.width;

        //Convert to 32-bit floating point
        image.convertTo(image,CV_32FC1);
        
        //GPU-DFT
        int device =cv::cuda::getCudaEnabledDeviceCount();
        std::cout<<"Number of CUDA devices: "<< device << std::endl;
        int getD = cv::cuda::getDevice();
        cv::cuda::setDevice(getD);
        //Get dft size
        int h =cv:: getOptimalDFTSize( image.rows );
        int w =cv:: getOptimalDFTSize( image.cols );
        cv::Size dftsize(h,w);
        cv::Mat sizedimage;
        cv::Mat transform = cv::Mat(h,w/2+1,CV_32FC2);
        //Resize Image
        cv::resize(image,sizedimage,dftsize);
        //Upload image to GpuMat
        cv::cuda::GpuMat gputransform = cv::cuda::GpuMat(h,w/2+1,CV_32FC2);
        cv::cuda::GpuMat gpuimage;
        gpuimage.upload(sizedimage);
        //DFT
        for(int i=0; i<3;i++)
        {
        double t = (double)cv::getTickCount();
        cv::cuda::dft(gpuimage,gputransform,sizedimage.size());
        t = ((double)cv::getTickCount() - t)/cv::getTickFrequency();
        std::cout<<"Total time for GPU-DFT: "<<t << std::endl;
        }
        //Download transformed image to CPU
        gputransform.download(transform);

        //CPU-DFT
        cv::Mat cputransform = cv::Mat(h,w/2+1,CV_32FC2);
        double totalcputime = (double)cv::getTickCount();
        cv::dft(sizedimage,cputransform);
        totalcputime = ((double)cv::getTickCount() - totalcputime)/cv::getTickFrequency();
        std::cout<<"\nTotal time for CPU-DFT: "<<totalcputime<<std::endl;
return 0;
}

And that’s what I get:

Number of CUDA devices: 1
Total time for GPU-DFT: 1.08107
Total time for GPU-DFT: 0.0637337
Total time for GPU-DFT: 0.0400113

Total time for CPU-DFT: 0.785276

WayneWWW · July 31, 2017, 3:09am

HI JaTxGPU,

Could you share the sample you are using? Are you using opencv4tegra?

JaTxGPU · July 31, 2017, 7:23am

@WayneWWW I’ve edited my question. No I’m not using opencv4tegra. There was an issue with that library in jetson and I just noticed that it is solved few days ago. I’ll install it today

WayneWWW · July 31, 2017, 7:43am

Could you try the CUDA-8.0 samples-CUFFT samples and see if the operations are still slow?

JaTxGPU · July 31, 2017, 8:05am

The one that uses random data?
As far as I know, the issue is that it is not easy to use opencv and cuda in the same code!

WayneWWW · July 31, 2017, 8:18am

I think one problem here is that we don’t guarantee the FFT performance in opencv. We have cuFFT library and it is still maintained.
Do you raise the gpu clk to maximum?

JaTxGPU · July 31, 2017, 9:54am

So how can I use cuFFT with my image?

snarky · July 31, 2017, 4:35pm

Another thing to realize is that “throughput” and “latency” are different things.
In high-performance, high-throughput, GPU-accelerated computing, you will typically have multiple pieces of code working on multiple parts of the problem at the same time.
The GPU might be working on sample N.
The CPU should then be working on dealing with the results of sample N-1, as well as working on preparing sample N+1.
If your code is fully serial: Prepare sample, GPU process sample, Work on results of sample; repeat; then you will have much lower throughput, because you force each piece of the pipeline to be idle while the other piece is working, and there is no overlapping work to soak up the latency.

Topic		Replies	Views
OpenCV dft vs. gpu::dft Performance GPU-Accelerated Libraries opencv	6	4439	July 8, 2017
FFT Speed vs. x86 CUDA Programming and Performance	14	24969	July 27, 2008
cufft1D and MemcpyAsync = gain? optimization of Host/Device transfert CUDA Programming and Performance	2	785	March 13, 2011
Performance issues on memory transfer CUDA Programming and Performance	13	13054	November 26, 2010
DATA tranfer from CPU to GPU CUDA Programming and Performance	6	4884	April 23, 2008
transpose demo: gpu vs cpu CUDA Programming and Performance	3	9455	August 9, 2007
CUFFT Question? Confusing CUFFT times CUDA Programming and Performance	2	1741	January 23, 2009
Device-callable FFT? CUDA Programming and Performance	4	6258	April 21, 2007
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	241636	November 16, 2011
Poor CUFFT Performance? Am I doing something wrong? CUDA Programming and Performance	15	15630	May 4, 2010

Transfer data CPU/GPU is an issue..

Related topics