Bad Performance of CUFFT library? compilation flags for optimizing fft performance

Hi,

I’m registering a very bad performance speed-up for FFT in 3 dimensions.

My problem boils down to fourier transform hundreds of different 64x64x64 voxel volumes in single precision. When I measure the average time, I see that this time (~5 ms) is 3x faster than the time on a Intel Core Duo (14ms).

This small speed-up is very surprising (and disapointing) to me. I’ve been looking for typical computatio times, but I’m still not sure if this poor performance is expected or if I am doing something wrong with the compilation flags. I compile my “hello-world”-like fft timer program as

nvcc -o cufft_time cufft_time.cu -lcudart -lcufft -L/usr/local/cuda/lib64 -I/usr/local/cuda/include

Should I use any special flag for better performance of the cufft?

Thanks in advance,
Daniel

Can you post your code? This would make it easier to guess whats happening. And what GPU are you using?

Ceearem

Hi Ceearem,

right. My GPU is a Tesla C1060.

And this would be the code:

int main()

{

cufftReal *d_data_real, ;

    cufftReal *h_data_real;

    cufftComplex *d_data_complex;

long n=64;

    long N=n*n*n;

    long NbytesReal=N*sizeof(cufftReal);

    long NbytesComplex=N*sizeof(cufftComplex);

cufftHandle plan;

     cufftPlan3d(&plan,n,n,n,CUFFT_R2C);

h_data_real=(cufftReal *)malloc(NbytesReal);

for (long i=0;i<N;i++)

     {

    	 h_data_real[i]=1.0;

     }

HANDLE_ERROR(cudaMalloc((void**)&d_data_real,NbytesReal));

     HANDLE_ERROR(cudaMalloc((void**)&d_data_complex,NbytesComplex));

     HANDLE_ERROR(cudaMemcpy(d_data_real,h_data_real,NbytesReal,cudaMemcpyHostToDevice));

// start timing

    cudaEvent_t start_loop, stop_loop;

    HANDLE_ERROR(cudaEventCreate(&start_loop));

    HANDLE_ERROR(cudaEventCreate(&stop_loop));

    HANDLE_ERROR(cudaEventRecord(start_loop,0));

// main loop 

    for (int i=0; i<100;i++)

    {

    	cufftExecR2C(plan, d_data_real, d_data_complex);

    }

// stop timing functions 

     HANDLE_ERROR(cudaEventRecord(stop_loop,0));

     HANDLE_ERROR(cudaEventSynchronize(stop_loop));

     float time_for_loop;

     HANDLE_ERROR(cudaEventElapsedTime(&time_for_loop,start_loop,stop_loop));

     printf("time for loop: %3.1f ms  \n",time_for_loop);

	

cufftDestroy(plan);

    free(h_data_real);

    cudaFree(d_data_real);

    cudaFree(d_data_complex);

    cudaEventDestroy(start_loop);

    cudaEventDestroy(stop_loop);

 }

Two questions:

(1) How are you measuring the timing of these FFTs?

(2) From the sample code, it appears that you’re using 100 separate (non-batched) FFT’s. Have you tried using a single batched 3D transform instead? The performance of batched transforms is often substantially better, especially when the separate non-batched transforms can’t fill the GPU.

–Cliff

Hi Cliff,

thanks for your answer.

I just edited the code above to show explicitly the used timing functions

Mh, interesting. I thought batching was only possible for 1D (or I never found a sample code with batching in 3D). In any case I’m not so optimistic, as this apparent underperformance repeats systematically all the way for all possible sizes, also comprising sizes that fill the GPU.

I’ve tried different compilation approachs, but results remain bad. I’m wondering if it can be related to the CUDA version 3.2. I’ve read other reports in this direction.

Does this mean you tried batched mode (cufftPlanMany) and it didn’t help? Can you post the version of the code you used for that as well?

There shouldn’t be any particular reason why CUFFT 3.2 would be hurting you here, so I’d like to do a bit more analysis to figure out why you’re seeing the performance you’re seeing.

Thanks,

Cliff

Hi Cliff,

well, I just found the error in my code: one of the subroutines invoked before the actual fft computation included a hidden call to the cudaSetDevice that placed the computation in the wrong GPU. When everything is in place, CUFFT 3.2 works fine. This thread should perhaps be deleted, as it could mislead other users.

Thanks for the update! Glad you found the issue and got it all sorted out.

I’m thinking we can leave it here; if anyone runs across an issue where multiple FFTs in a loop aren’t performing well for them, perhaps this thread will give them the idea to try batching the FFTs to get a performance boost.

–Cliff

Hi,

I am currently dealing with a similar problem. Given a 2D color image which I already batched, I want to perform an FFT via cufft, do some computations and perform an iFFT. This is done iteratively. I noticed a sudden decrease of performance in my computations of the cufft. Even if I boil down my program to a simple FFT toy example, I notice this decrease.

At first, my program shows very nice results.

0.02 ms is very good, however, after a few iterations this happens:

Of course, 9 ms is still adequate, however in my working code, this even may increase to 30 ms compared to 0.05 ms in the initial iterations which is devastating for real-time performance.

And this continues up until the end of my computations.

My question would be, whether there was some error in the code or whether this has to do with the graphics card itself. I am currently working on a GTX 460 in a Linux environment.

#include <stdio.h>

#include <stdlib.h>

#include <math.h>

#include <time.h>

#include <cufft.h>

#include <sys/time.h>

#include <cutil_inline.h>

#include <complex.h>

#define MAX_THREADS 128

#define MAX_BLOCKS 1024

// perform a rescaling operation

static __global__ void rescale(cufftDoubleReal *d_Dataco, long N)

{

        int x = threadIdx.x + blockIdx.x * blockDim.x;

	int y = threadIdx.y + blockIdx.y * blockDim.y;

	int tid = x + y * blockDim.x * gridDim.x;

	while (tid < N)

	{

          d_Dataco[tid] = d_Dataco[tid]/(N);	

	  d_Dataco[tid+N] = d_Dataco[tid+N]/(N);

	  d_Dataco[tid+2*N] = d_Dataco[tid+2*N]/(N);

	  tid += blockDim.x * gridDim.x; 

	}

   __syncthreads(); 

}

static __global__ void setup(cufftDoubleReal *d_data, long N)

{

        int x = threadIdx.x + blockIdx.x * blockDim.x;

	int y = threadIdx.y + blockIdx.y * blockDim.y;

	int tid = x + y * blockDim.x * gridDim.x;

	while (tid < N)

	{

          d_data[tid] = tid;	

	  tid += blockDim.x * gridDim.x; 

	}

   __syncthreads(); 

}

int main()    

{

   const int dataH = 1024;                  /* double length of height  */

   const int dataW = 1024;                  /* double length of width   */

int l;                                   /* loop variables           */

   int num[2];                              /* vector for cufftManyPlan */

   int max_t  = 200; 

   int N      = dataH*dataW;                /* helper variable for size */

   unsigned int bytes = 3*N*sizeof(double); /* helper for allocation    */

   num[0]=dataH; num[1]=dataW;              /* vector for cufftManyPlan */

// vectors on the host (CPU)

   double *h_ResultGPUco;                   /* output vector (linear)   */

   // vectors on the device (GPU)

   cufftDoubleReal *d_data;                 /* intermediate results     */

   cufftDoubleComplex *d_DataSpectrum;      /* vector in Fourier domain */

   cufftHandle fftPlanFwd, fftPlanInv;      /* cufft Plans for FFT      */

   cufftCompatibility mode = CUFFT_COMPATIBILITY_NATIVE; /* Special Mode*/

h_ResultGPUco  = (double *)malloc(bytes);

   cudaMalloc((void **)&d_data,         3*N                    * sizeof(cufftDoubleReal));	

   cudaMalloc((void **)&d_DataSpectrum, 3*dataH * (dataW/2+1)  * sizeof(cufftDoubleComplex));

setup<<<MAX_BLOCKS,MAX_THREADS>>>(d_data,3*N);

cufftPlanMany(&fftPlanFwd, 2, num, NULL, 1, 0, NULL, 1, 0, CUFFT_D2Z,3);

   cufftPlanMany(&fftPlanInv, 2, num, NULL, 1, 0, NULL, 1, 0, CUFFT_Z2D,3);

cufftSetCompatibilityMode(fftPlanFwd, mode);

   cufftSetCompatibilityMode(fftPlanInv, mode);	

int p=0;

for(l=0;l<max_t;l++)

   {

     p++;

     // timer for inner loop

     unsigned int hTimer;

     cutCreateTimer(&hTimer) ;

     cutStartTimer(hTimer) ; 

cufftExecD2Z(fftPlanFwd, (cufftDoubleReal *)d_data, (cufftDoubleComplex *)d_DataSpectrum);

cufftExecZ2D(fftPlanInv ,(cufftDoubleComplex *)d_DataSpectrum, (cufftDoubleReal *)d_data);

cutStopTimer(hTimer) ;

     double gpuTime = cutGetTimerValue(hTimer);

     //gpuZeit += gpuTime;

     printf("Iteration No %i: %f MPix/s (%f ms)\n", p,(double)dataW * (double)dataH  *1e-6 / (gpuTime * 0.001), gpuTime);       	

     cutDeleteTimer(hTimer) ;

rescale<<<MAX_BLOCKS,MAX_THREADS>>>(d_data,N);

}

// copy result back to CPU

   // Lots of destruction and free-ing

   cufftDestroy(fftPlanInv);

   cufftDestroy(fftPlanFwd);

   cudaFree(d_DataSpectrum);

   cudaFree(d_data);

   free(h_ResultGPUco);

   return 1;

}

Your time measurement is not too correct. cufftExecZ2D returns before the fft procedure finished, like usual kernel call. If you add synchronizing before cutStopTimer(hTimer), the time can change dramatically.

I tried that out by using a cudaThreadSynchronize() call, however the times do change dramatically, but unfortunately in the wrong direction, i.e. 10 ms per cycle :(

It is the correct direction.