Bad Performance of CUFFT library? compilation flags for optimizing fft performance

arabarra · March 7, 2011, 5:34pm

Hi,

I’m registering a very bad performance speed-up for FFT in 3 dimensions.

My problem boils down to fourier transform hundreds of different 64x64x64 voxel volumes in single precision. When I measure the average time, I see that this time (~5 ms) is 3x faster than the time on a Intel Core Duo (14ms).

This small speed-up is very surprising (and disapointing) to me. I’ve been looking for typical computatio times, but I’m still not sure if this poor performance is expected or if I am doing something wrong with the compilation flags. I compile my “hello-world”-like fft timer program as

nvcc -o cufft_time cufft_time.cu -lcudart -lcufft -L/usr/local/cuda/lib64 -I/usr/local/cuda/include

Should I use any special flag for better performance of the cufft?

Thanks in advance,
Daniel

ceearem · March 7, 2011, 5:50pm

Can you post your code? This would make it easier to guess whats happening. And what GPU are you using?

Ceearem

arabarra · March 7, 2011, 6:05pm

Hi Ceearem,

right. My GPU is a Tesla C1060.

And this would be the code:

int main()

{

cufftReal *d_data_real, ;

    cufftReal *h_data_real;

    cufftComplex *d_data_complex;

long n=64;

    long N=n*n*n;

    long NbytesReal=N*sizeof(cufftReal);

    long NbytesComplex=N*sizeof(cufftComplex);

cufftHandle plan;

     cufftPlan3d(&plan,n,n,n,CUFFT_R2C);

h_data_real=(cufftReal *)malloc(NbytesReal);

for (long i=0;i<N;i++)

     {

    	 h_data_real[i]=1.0;

     }

HANDLE_ERROR(cudaMalloc((void**)&d_data_real,NbytesReal));

     HANDLE_ERROR(cudaMalloc((void**)&d_data_complex,NbytesComplex));

     HANDLE_ERROR(cudaMemcpy(d_data_real,h_data_real,NbytesReal,cudaMemcpyHostToDevice));

// start timing

    cudaEvent_t start_loop, stop_loop;

    HANDLE_ERROR(cudaEventCreate(&start_loop));

    HANDLE_ERROR(cudaEventCreate(&stop_loop));

    HANDLE_ERROR(cudaEventRecord(start_loop,0));

// main loop 

    for (int i=0; i<100;i++)

    {

    	cufftExecR2C(plan, d_data_real, d_data_complex);

    }

// stop timing functions 

     HANDLE_ERROR(cudaEventRecord(stop_loop,0));

     HANDLE_ERROR(cudaEventSynchronize(stop_loop));

     float time_for_loop;

     HANDLE_ERROR(cudaEventElapsedTime(&time_for_loop,start_loop,stop_loop));

     printf("time for loop: %3.1f ms  \n",time_for_loop);

	

cufftDestroy(plan);

    free(h_data_real);

    cudaFree(d_data_real);

    cudaFree(d_data_complex);

    cudaEventDestroy(start_loop);

    cudaEventDestroy(stop_loop);

 }

Cliff_Woolley · March 7, 2011, 10:44pm

Two questions:

(1) How are you measuring the timing of these FFTs?

(2) From the sample code, it appears that you’re using 100 separate (non-batched) FFT’s. Have you tried using a single batched 3D transform instead? The performance of batched transforms is often substantially better, especially when the separate non-batched transforms can’t fill the GPU.

–Cliff

arabarra · March 8, 2011, 7:10am

Hi Cliff,

thanks for your answer.

I just edited the code above to show explicitly the used timing functions

Mh, interesting. I thought batching was only possible for 1D (or I never found a sample code with batching in 3D). In any case I’m not so optimistic, as this apparent underperformance repeats systematically all the way for all possible sizes, also comprising sizes that fill the GPU.

I’ve tried different compilation approachs, but results remain bad. I’m wondering if it can be related to the CUDA version 3.2. I’ve read other reports in this direction.

Cliff_Woolley · March 17, 2011, 7:04pm

Does this mean you tried batched mode (cufftPlanMany) and it didn’t help? Can you post the version of the code you used for that as well?

There shouldn’t be any particular reason why CUFFT 3.2 would be hurting you here, so I’d like to do a bit more analysis to figure out why you’re seeing the performance you’re seeing.

Thanks,

Cliff

arabarra · March 21, 2011, 12:04pm

Hi Cliff,

well, I just found the error in my code: one of the subroutines invoked before the actual fft computation included a hidden call to the cudaSetDevice that placed the computation in the wrong GPU. When everything is in place, CUFFT 3.2 works fine. This thread should perhaps be deleted, as it could mislead other users.

Cliff_Woolley · March 21, 2011, 3:21pm

Thanks for the update! Glad you found the issue and got it all sorted out.

I’m thinking we can leave it here; if anyone runs across an issue where multiple FFTs in a loop aren’t performing well for them, perhaps this thread will give them the idea to try batching the FFTs to get a performance boost.

–Cliff

Arvel_Crynyd · February 16, 2012, 6:01pm

Hi,

I am currently dealing with a similar problem. Given a 2D color image which I already batched, I want to perform an FFT via cufft, do some computations and perform an iFFT. This is done iteratively. I noticed a sudden decrease of performance in my computations of the cufft. Even if I boil down my program to a simple FFT toy example, I notice this decrease.

At first, my program shows very nice results.

0.02 ms is very good, however, after a few iterations this happens:

Of course, 9 ms is still adequate, however in my working code, this even may increase to 30 ms compared to 0.05 ms in the initial iterations which is devastating for real-time performance.

And this continues up until the end of my computations.

My question would be, whether there was some error in the code or whether this has to do with the graphics card itself. I am currently working on a GTX 460 in a Linux environment.

#include <stdio.h>

#include <stdlib.h>

#include <math.h>

#include <time.h>

#include <cufft.h>

#include <sys/time.h>

#include <cutil_inline.h>

#include <complex.h>

#define MAX_THREADS 128

#define MAX_BLOCKS 1024

// perform a rescaling operation

static __global__ void rescale(cufftDoubleReal *d_Dataco, long N)

{

        int x = threadIdx.x + blockIdx.x * blockDim.x;

	int y = threadIdx.y + blockIdx.y * blockDim.y;

	int tid = x + y * blockDim.x * gridDim.x;

	while (tid < N)

	{

          d_Dataco[tid] = d_Dataco[tid]/(N);	

	  d_Dataco[tid+N] = d_Dataco[tid+N]/(N);

	  d_Dataco[tid+2*N] = d_Dataco[tid+2*N]/(N);

	  tid += blockDim.x * gridDim.x; 

	}

   __syncthreads(); 

}

static __global__ void setup(cufftDoubleReal *d_data, long N)

{

        int x = threadIdx.x + blockIdx.x * blockDim.x;

	int y = threadIdx.y + blockIdx.y * blockDim.y;

	int tid = x + y * blockDim.x * gridDim.x;

	while (tid < N)

	{

          d_data[tid] = tid;	

	  tid += blockDim.x * gridDim.x; 

	}

   __syncthreads(); 

}

int main()    

{

   const int dataH = 1024;                  /* double length of height  */

   const int dataW = 1024;                  /* double length of width   */

int l;                                   /* loop variables           */

   int num[2];                              /* vector for cufftManyPlan */

   int max_t  = 200; 

   int N      = dataH*dataW;                /* helper variable for size */

   unsigned int bytes = 3*N*sizeof(double); /* helper for allocation    */

   num[0]=dataH; num[1]=dataW;              /* vector for cufftManyPlan */

// vectors on the host (CPU)

   double *h_ResultGPUco;                   /* output vector (linear)   */

   // vectors on the device (GPU)

   cufftDoubleReal *d_data;                 /* intermediate results     */

   cufftDoubleComplex *d_DataSpectrum;      /* vector in Fourier domain */

   cufftHandle fftPlanFwd, fftPlanInv;      /* cufft Plans for FFT      */

   cufftCompatibility mode = CUFFT_COMPATIBILITY_NATIVE; /* Special Mode*/

h_ResultGPUco  = (double *)malloc(bytes);

   cudaMalloc((void **)&d_data,         3*N                    * sizeof(cufftDoubleReal));	

   cudaMalloc((void **)&d_DataSpectrum, 3*dataH * (dataW/2+1)  * sizeof(cufftDoubleComplex));

setup<<<MAX_BLOCKS,MAX_THREADS>>>(d_data,3*N);

cufftPlanMany(&fftPlanFwd, 2, num, NULL, 1, 0, NULL, 1, 0, CUFFT_D2Z,3);

   cufftPlanMany(&fftPlanInv, 2, num, NULL, 1, 0, NULL, 1, 0, CUFFT_Z2D,3);

cufftSetCompatibilityMode(fftPlanFwd, mode);

   cufftSetCompatibilityMode(fftPlanInv, mode);	

int p=0;

for(l=0;l<max_t;l++)

   {

     p++;

     // timer for inner loop

     unsigned int hTimer;

     cutCreateTimer(&hTimer) ;

     cutStartTimer(hTimer) ; 

cufftExecD2Z(fftPlanFwd, (cufftDoubleReal *)d_data, (cufftDoubleComplex *)d_DataSpectrum);

cufftExecZ2D(fftPlanInv ,(cufftDoubleComplex *)d_DataSpectrum, (cufftDoubleReal *)d_data);

cutStopTimer(hTimer) ;

     double gpuTime = cutGetTimerValue(hTimer);

     //gpuZeit += gpuTime;

     printf("Iteration No %i: %f MPix/s (%f ms)\n", p,(double)dataW * (double)dataH  *1e-6 / (gpuTime * 0.001), gpuTime);       	

     cutDeleteTimer(hTimer) ;

rescale<<<MAX_BLOCKS,MAX_THREADS>>>(d_data,N);

}

// copy result back to CPU

   // Lots of destruction and free-ing

   cufftDestroy(fftPlanInv);

   cufftDestroy(fftPlanFwd);

   cudaFree(d_DataSpectrum);

   cudaFree(d_data);

   free(h_ResultGPUco);

   return 1;

}

L_F · February 16, 2012, 6:56pm

Your time measurement is not too correct. cufftExecZ2D returns before the fft procedure finished, like usual kernel call. If you add synchronizing before cutStopTimer(hTimer), the time can change dramatically.

Arvel_Crynyd · February 17, 2012, 12:43pm

I tried that out by using a cudaThreadSynchronize() call, however the times do change dramatically, but unfortunately in the wrong direction, i.e. 10 ms per cycle :(

L_F · February 17, 2012, 4:10pm

It is the correct direction.

Topic		Replies	Views
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13490	October 27, 2010
How to show CuFFT routines show higher performance than normal MATLAB fft() in terms of time taken. CUDA Programming and Performance	13	3164	July 10, 2014
FFT Computation Timing constraint on GPU. CUDA Programming and Performance	0	707	August 22, 2014
Poor CUFFT Performance? Am I doing something wrong? CUDA Programming and Performance	15	15486	May 4, 2010
Multiple batches of 1D FFT using cuFFT GPU-Accelerated Libraries	10	5043	October 29, 2019
Batched 2D FFT implementation CUDA Programming and Performance	32	48125	January 13, 2011
optimizing FFT calculation? CUDA Programming and Performance	8	6496	May 26, 2008
CUFFT appears to give errors for vectors > 1024 CUDA Programming and Performance	6	8765	April 12, 2007
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	241043	November 16, 2011
CUFFT: calculation time CUDA Programming and Performance	6	2671	April 21, 2012

Bad Performance of CUFFT library? compilation flags for optimizing fft performance

Related topics