Comparing cuda fft and matlab fft

shinkee · January 23, 2008, 10:26am

Hi all, I’ve got my cuda (FX Quadro 1700) running in Fedora 8, and now i’m trying to get some evidence of speed up by comparing it with the fft of matlab.

The matlab code and the simple cuda code i use to get the timing are pasted below. Now i’m having problem in observing speedup caused by cuda. Currently when i call the function timing(2048*2048, 6), my output is

CUFFT: 

Elapsed time is 1.038155 seconds.

MATLAB FFT: 

Elapsed time is 1.596426 seconds.

which doesn’t seem so impressive…

So can anyone point how i can get speedup of maybe 10x in fft code just as mentioned in the white paper in this page? Thanks in advance!

timing.m ( it can be called by issuing command “timing(2048, 4)” as an example )

function [a, b] = timing(datasize, batch) 

d = zeros(datasize, batch);

for i = 1:datasize

    d(i,:) = i;

end

disp('CUFFT: '); tic; a = mexCUFFT(d); toc;

disp('MATLAB FFT: '); tic; b = fft(d); toc;

mexCUFFT.cu (compiled with the command “$MATLAB_CUDA/nvmex -f $MATLAB_CUDA/nvopts.sh -I/usr/local/cuda/include -L/usr/local/cuda/lib -lcufft -lcudart” where $MATLAB_CUDA is the path of the matlab plugin for cuda)

#include "cufft.h"

#include "cuda.h"

#include "mex.h"

#include "cuda_runtime.h"

void pack_r2c(cufftComplex  *output_float, 

              double *input_re, 

              int Ntot)

{

    int i;

    for (i = 0; i < Ntot; i++) 

       {

               output_float[i].x = input_re[i];

               output_float[i].y = 0.0f;

       }

}

void pack_c2c(cufftComplex  *output_float, 

              double *input_re, 

              double *input_im, 

              int Ntot)

{

    int i;

    for (i = 0; i < Ntot; i++) 

      {

               output_float[i].x = input_re[i];

               output_float[i].y = input_im[i];

      }

}

void unpack_c2c(cufftComplex  *input_float, 

                double *output_re, 

                double *output_im,  

                int Ntot)

{

    int i;

    for (i = 0; i < Ntot; i++) 

    {

        output_re[i] = input_float[i].x;

        output_im[i] = input_float[i].y;

    }

}

cufftComplex *runfft(cufftComplex *data, int m, int n);

// Program use to calculate the fft for a simple matrix.

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])

{

    int m, n;

    double *inDataR, *inDataI, *outDataR, *outDataI;

    cufftComplex *data;

    

    if( nrhs < 1 ) mexErrMsgTxt( "Input argument not defined." );

    m = mxGetM(prhs[0]);

    n = mxGetN(prhs[0]);

   /* Allocating host memory. */

    data = (cufftComplex *)mxMalloc(sizeof(cufftComplex) * n * m);

    

    inDataR = mxGetPr(prhs[0]);

   if( mxIsComplex(prhs[0]) )

    {

	/* If it is a complex data. */

	inDataI = mxGetPi(prhs[0]);

	pack_c2c( data, inDataR, inDataI, m*n );

   }

    else

    {

	/* If it is a real data. */

	pack_r2c( data, inDataR, m*n );

    }

    

    data = runfft(data, m, n);

   plhs[0] = mxCreateDoubleMatrix(m, n, mxCOMPLEX);

    outDataR = mxGetPr(plhs[0]);

    outDataI = mxGetPi(plhs[0]);

   unpack_c2c(data, outDataR, outDataI, n*m); 

   mxFree(data);

    return;

}

cufftComplex *runfft(cufftComplex *data, int m, int n)

{

   // Allocate device memory for data

    cufftComplex *d_data;

    cudaMalloc( (void **)&d_data, sizeof(cufftComplex) * m * n );

   // Copy host memory to device

    cudaMemcpy(d_data, data, m * n * sizeof(cufftComplex), cudaMemcpyHostToDevice);

   // CUFFT plan

    cufftHandle plan;

    cufftPlan1d(&plan, m, CUFFT_C2C, n);

   // FFT execution

    cufftExecC2C(plan, (cufftComplex *)d_data, (cufftComplex *)d_data, CUFFT_FORWARD);

   // Copy result to host

    cudaMemcpy(data, d_data, n*m * sizeof(cufftComplex), cudaMemcpyDeviceToHost);

   // Clear device memory

    cufftDestroy(plan);

    cudaFree(d_data);

    return data;

}

shinkee · January 24, 2008, 5:41am

so does anyone know anything about this?
what’s the speedup that u guys got in CUFFT?

DenisR · January 24, 2008, 7:04am

Well the matlab plugin has an example that also uses FFT in which I see great speedups. So you might want to see how much speedup you have there. If that has a lot of speedup for you you can start to look at what is different between the two (length of fft, etc.)

strahl · February 5, 2008, 6:10pm

Hi shinkee :)

Matlab’s fft() uses the libfftw3 library [FFTW stands for “Fastest Fourier Transform in the West” x)]. I assume that the paper you are referring to compares the CUFFT with a standard FFT implementation? Have a look at [url=“http://www.science.uwaterloo.ca/~hmerz/CUDA_benchFFT/”]http://www.science.uwaterloo.ca/~hmerz/CUDA_benchFFT/[/url] for the expected performance gain when using Matlab.

:) stefan

tom_TUD · February 5, 2008, 6:45pm

I have the same question, look here: [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

yahastu · February 10, 2008, 4:25pm

I am new to CUDA, but don’t forget that there is significant overhead involved in simply passing the data to CUDA and then retrieving it back again. Therefore, perhaps you should first measure the amount of time it takes to send the (same amount) of data to the GPU and back again (with no CUDA processing), and then subtract that from the CUDA FFT runtime to get a better idea of how much processing time it takes, and hence, how well it scales to larger problems.

Topic		Replies	Views
How to show CuFFT routines show higher performance than normal MATLAB fft() in terms of time taken. CUDA Programming and Performance	13	3171	July 10, 2014
FFT Computation Timing constraint on GPU. CUDA Programming and Performance	0	708	August 22, 2014
Estimating FFT Performance CUDA Programming and Performance	9	1549	June 4, 2010
CUDA slower than MATLAB... again I can't get the simplest examples to show any speed-up using GP CUDA Programming and Performance	5	2536	February 18, 2011
optimizing FFT calculation? CUDA Programming and Performance	8	6507	May 26, 2008
Bad Performance of CUFFT library? compilation flags for optimizing fft performance CUDA Programming and Performance	11	13504	February 17, 2012
FFT Speed vs. x86 CUDA Programming and Performance	14	24794	July 27, 2008
How can I get good performance from cuFFT? GPU-Accelerated Libraries	2	1427	June 8, 2016
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13516	October 27, 2010
Windows 7 64bit, Matlab2009a/VS2008, CUDA is SLOW, WHY CUDA Programming and Performance	2	5631	January 1, 2010

Comparing cuda fft and matlab fft

Related topics