FFTW Vs CUFFT Performance

stuartlittle_80 · March 4, 2008, 9:54pm

Hello,

Can anyone help me with this.

Old Code: Inside fortran

call sfftw_plan_dft_3d(plan,n1,n2,n3,cx,cx,ifset,64)
call sfftw_execute (plan)
call sfftw_destroy_plan (plan)

New Code: Inside Fortran
call tempfft(n1,n2,n3,cx,direction)

tempfft.cu
#include <stdio.h>
#include <cufft.h>
#include <cutil.h>
#include <cuComplex.h>
#include “cuda.h”

extern “C” void tempfft_(int *n1, int *n2, int *n3,cufftComplex *data, int direction)
{
int Nx = *n1;
int Ny = *n2;
int Nz = *n3;
cufftComplex *d_data;

    CUT_DEVICE_INIT();
    CUDA_SAFE_CALL(cudaMalloc((void**) &d_data, sizeof(cufftComplex)*Nx*Ny*Nz));

    CUDA_SAFE_CALL(cudaMemcpy(d_data, data, Nx*Ny*Nz*sizeof(cufftComplex), cudaMemcpyHostToDevice));

    cufftHandle plan1;
    CUDA_SAFE_CALL(cufftPlan3d(&plan1, Nz, Ny, Nx, CUFFT_C2C));

    if(direction<0)
            CUDA_SAFE_CALL(cufftExecC2C(plan1, (cufftComplex *)d_data, (cufftComplex *)d_data, CUFFT_FORWARD));
    else
            CUDA_SAFE_CALL(cufftExecC2C(plan1, (cufftComplex *)d_data, (cufftComplex *)d_data, CUFFT_INVERSE));

    CUDA_SAFE_CALL(cudaMemcpy(data, d_data, Nx*Ny*Nz*sizeof(cufftComplex), cudaMemcpyDeviceToHost));

    CUDA_SAFE_CALL(cufftDestroy(plan1));
    cudaFree(d_data);
    return;

}

When I simulate the above codes inside a big FORTRAN Application

FFTW code takes about 21 minutes for each step while the CUDA code is taking about 66 minutes for each step.

a) Is there any way I can increase the performance ?

Thanks

MattWarmuth · March 6, 2008, 9:10pm

It would be better for you to set up the plan outside of this FFT call once and reuse that plan instead of creating a new one every time you want to do an FFT. This assumes of course that you’re doing the same size and type (C2C, C2R, etc.) of FFT everytime.

However, the bigger issue here (which I’m guessing you can’t get away from) is the fact that you’re moving the entire input and output of the FFT across the system bus each time. Even if the FFT is done 10x faster on the card, it might take all that time saved (plus more) just to get the data on and off the card. In a graphics rendering situation, the data is almost always one way (to the card), and lot’s of large data arrays (like textures) are already loaded on the card.

Topic		Replies	Views
FFT in CUDA CUDA Programming and Performance	2	2697	February 18, 2008
FFTW output Vs CUDAFFT output Different outputs CUDA Programming and Performance	2	11326	May 6, 2008
CUFFT: calculation time CUDA Programming and Performance	6	2665	April 21, 2012
CUDA enabled cuFFT slower than general purpose FFTW Jetson TX2	10	4372	October 18, 2021
Comparing cuda fft and matlab fft CUDA Programming and Performance	5	6141	February 10, 2008
CUFFT parameters for planning Changing the type of planning for CUFFT CUDA Programming and Performance	4	3388	January 15, 2010
CUFFT issue CUDA Programming and Performance	0	1105	December 29, 2009
Fortran and cuFFT CUDA Programming and Performance	8	17823	September 19, 2009
FFT Speed vs. x86 CUDA Programming and Performance	14	24654	July 27, 2008
FFT Computation Timing constraint on GPU. CUDA Programming and Performance	0	706	August 22, 2014

FFTW Vs CUFFT Performance

Related topics