How can I get good performance from cuFFT?

Anders_G · June 7, 2016, 7:42pm

Hi!

I need to move some calculations to the GPU where I will compute a batch of 32 2D FFTs each having size 600 x 600. When I compare the performance of cufft with matlab gpu fft, then cufft is much! slower, typically a factor 10 (when I have removed all overhead from things like plan creation). How is this possible? Is this what to expect from cufft or is there any way to speed up cufft? (I would simple use matlabs fft if I could but when I mix it up with some iffts, sums and element wise multiplications it becomes super-slow in an unpredictable way.)

// The core of my code

mwSize ndim = mxGPUGetNumberOfDimensions(C_q);    
mwSize const * dimSize = mxGPUGetDimensions(C_q);


// FFT test
cufftHandle plan;
int dd[3];
dd[1] = (int)dimSize[0];
dd[0] = (int)dimSize[1];
dd[2] = (int)dimSize[2];

int Nq = dd[2];
dimSize = mxGPUGetDimensions(Phi_j);
int L = dimSize[2];


// OBS quite some overhead here. Use default settings for the memory layout. Seem to give the right    answer. Ok?
cufftPlanMany(&plan, 2, dd, NULL,0,0,NULL,0,0,CUFFT_C2C,Nq);


// Loop and sum over singular values
for (int i = 0; i<L; i++)
{
    // Do the fft
    cufftExecC2C(plan,(cufftComplex *) pS_q,(cufftComplex *) pC_q,CUFFT_FORWARD);

}

/ Anders

Anders_G · June 7, 2016, 9:02pm

An update with a full example for someone to test:

% Matlab side code

% Compile using:
% >> mexcuda -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\lib\x64" -lcufft abc.cu

A = gpuArray.randn(600,600,32,‘single’) + 1i*randn(600,600,32,‘single’);

tic
B = abc(A);
toc;

tic, for ii = 1:30, B = fft2(A); end; toc

AA = gather(A);
tic, for ii = 1:30, B = fft2(AA); end; toc

%% Output from a run

testabc
Elapsed time is 0.193155 seconds. % Mex file
Elapsed time is 0.004172 seconds. % Matlab fft2
Elapsed time is 1.455618 seconds. % Matlab CPU

// Mex-file code in the file abc.cu

include “mex.h”
include “gpu/mxGPUArray.h”
include <cufft.h>

// Interal type for complex. Same as cufftComplex just another name
typedef float2 Complex;

/*

Device code
*/

void mexFunction(int nlhs, mxArray *plhs,
int nrhs, mxArray const *prhs)
{

char const * const errId = "parallel:gpu:mexGPUExample:InvalidInput";
char const * const errMsg = "Invalid input to MEX file.";


/* Declare all variables.*/
mxGPUArray const *A;
mxGPUArray *B;

Complex const *pA;

Complex *pB;




/* Initialize the MathWorks GPU API. */
mxInitGPU();

/* Throw an error if the input is not a GPU array. */
if (nrhs!=1) {
    mexErrMsgIdAndTxt(errId, errMsg);
}

 for (int ii = 0; ii<1; ii++)
        if (!(mxIsGPUArray(prhs[ii])))
            mexErrMsgIdAndTxt(errId, errMsg);

A = mxGPUCreateFromMxArray(prhs[0]);



 // Verify that input is single arrays before extracting the pointer.
 
if (mxGPUGetClassID(A) != mxSINGLE_CLASS ) 
{
    mexErrMsgIdAndTxt(errId, errMsg);
}


/* Get the pointer to the data */
pA = (Complex const *)(mxGPUGetDataReadOnly(A));



/* Create a GPUArray to hold the result and get its underlying pointer. */
B = mxGPUCreateGPUArray(mxGPUGetNumberOfDimensions(A),
                        mxGPUGetDimensions(A),
                        mxGPUGetClassID(A),
                        mxGPUGetComplexity(A),
                        MX_GPU_DO_NOT_INITIALIZE);

pB = (Complex *)(mxGPUGetData(B));






// Now we can do work!  
mwSize const * dimSize = mxGPUGetDimensions(A);


// FFT test
cufftHandle plan;
int dd[2];
dd[1] = (int) dimSize[1];
dd[0] = (int) dimSize[0];

int Nq = (int) dimSize[2];
int L = 30;



cufftPlanMany(&plan, 2, dd, NULL,0,0,NULL,0,0,CUFFT_C2C,Nq);
for (int i = 0; i<L; i++)
{
    // Do the fft
    cufftExecC2C(plan,(cufftComplex *) pA,(cufftComplex *) pB,CUFFT_FORWARD);
}




/* Wrap the result up as a MATLAB gpuArray for return. */
plhs[0] = mxGPUCreateMxArrayOnGPU(B);

// Free resources
cufftDestroy(plan);

mxGPUDestroyGPUArray(A);
mxGPUDestroyGPUArray(B);

}

Robert_Crovella · June 8, 2016, 2:09am

why are you doing the same FFT L times in a row?

can you do it as a batch instead?

what GPU are you running this on?

Topic		Replies	Views
Performance of CuFFT 3.1 library CUDA Programming and Performance	0	3260	July 8, 2011
Comparing cuda fft and matlab fft CUDA Programming and Performance	5	6167	February 10, 2008
How to show CuFFT routines show higher performance than normal MATLAB fft() in terms of time taken. CUDA Programming and Performance	13	3171	July 10, 2014
Batched 1D FFT not faster than a loop for big images (1024x1024) GPU-Accelerated Libraries cuda	0	482	September 25, 2020
Large data size for cuFFT GPU-Accelerated Libraries	8	3944	September 8, 2018
Does cufft show much higher efficiency than cpu fft routines? CUDA Programming and Performance	10	9189	July 19, 2010
Multiple batches of 1D FFT using cuFFT GPU-Accelerated Libraries	10	5131	October 29, 2019
Bad Performance of CUFFT library? compilation flags for optimizing fft performance CUDA Programming and Performance	11	13494	February 17, 2012
The cufftEstimate2d has different result on GTX1080 and V100 GPU-Accelerated Libraries	2	573	December 25, 2019
optimizing FFT calculation? CUDA Programming and Performance	8	6505	May 26, 2008

How can I get good performance from cuFFT?

Related topics