optimizing FFT calculation?

hi, i have this function to benchmark my fft:

extern "C" void cuFFT(int *src,int len){

   CUT_DEVICE_INIT();

    

	cufftComplex *srcHost;

	cudaMallocHost((void **)&srcHost, len*sizeof(cufftComplex));

	for(int i=0;i<len;i++){

  srcHost[i].x=src[i];

  srcHost[i].y=0;	

	}

	

	cufftHandle plan;

	CUFFT_SAFE_CALL(cufftPlan1d(&plan, len, CUFFT_C2C, 1));

	

	cufftComplex *srcDevice;

	CUDA_SAFE_CALL(cudaMalloc((void**)&srcDevice, len*sizeof(cufftComplex)));

	cufftComplex *dstD;

	CUDA_SAFE_CALL(cudaMalloc((void**)&dstD, len*sizeof(cufftComplex)));

	int *dstDevice;

	CUDA_SAFE_CALL(cudaMalloc((void**)&dstDevice, len*sizeof(int)));

	CUDA_SAFE_CALL(cudaMemcpy(srcDevice, srcHost, len*sizeof(cufftComplex), cudaMemcpyHostToDevice));

	dim3 dimBlock(N_THREADS);

	dim3 dimGrid(len/N_THREADS);

	

	for(int i=0;i<100000;i++){

  CUFFT_SAFE_CALL(cufftExecC2C(plan, srcDevice, dstD,CUFFT_FORWARD));

	

  //cuMag<<<dimGrid,dimBlock>>>(dstD,dstDevice,len);

	}

	

	//CUDA_SAFE_CALL(cudaMemcpy(src, dstDevice, len*sizeof(int), cudaMemcpyDeviceToHost));

	CUFFT_SAFE_CALL(cufftDestroy(plan));

	CUDA_SAFE_CALL(cudaFree(dstDevice));

	CUDA_SAFE_CALL(cudaFree(dstD));

	CUDA_SAFE_CALL(cudaFree(srcDevice));

	

	cudaFreeHost(srcHost);

	

}

this is taking more or less the same amount of time than intel IPP FFT function also in a 100000 loop (39 seconds). if i put it calculating the Magnitude it takes more 3 or 4 seconds, which is even worse, considering the ipp function is already calculating the magnitude…

is it normal for the cufft function to take as much time as ipp? if not then what am i doing wrong? is there anything i can optimize? the cycle only has the fftexec function so i’m lost in optimizing …

i still have to optimize the Magnitude function so i’d be much happy if the fft time was better than ipp, or else my study results would be disapointing :(

which hardware do you use?

which FFT-length do you calculate?

I’m experiencing really good timings with FFTs up to 8k points and a 1D-length of about 32MB of data.

If you calculate too less FFT-points you might not use the whole capability of your device and so do not reach better timings.

other thing: you wrote you are comparing your timings to the intel IPP functions, do you know where I could get performance information about those IPP procedures for I would like to compare my code too?

thanks,

Vra

thank you for answering.

i have a 8800 GT.
i’m using 4096 samples array…
i have done a C++ application for IPP testing, i can share it with you if you’d like…
i’m using the same application core for cuda, i just changed the FFT function, obviously.

i think 4K arrays should present good results, no?

my CPU is a P4 3.2 Ghz which is even worse for my results …
i get a little increase in speed if i increase the loop number, but it’s nothing extraordinary.

the 4k FFT should show good behaviour…

I primarily use batched FFTs for they create more threads and should be able to hide memory latencies, in my example the 4k FFT itself takes about 9ms to execute a batch of 1024 windows (= 32MB of complex data). I’m using a tesla C870 and therefore have 2 more multiprocessors, but I think that shouldn’t be the reason for this big performance gap.

Have you tried the profiler for getting the results of every single kernel call? In your case there might be a lot of overhead because of the lots of kernel calls you are invoking from the host.

if i do 1000 FFT of 4096 samples i get less than a second too. (i’m not using milisecond measures, although i could search to use it)

thing is, i need the results of the FFT for analysis and i tried to batch it like 1024 in 4 or 256 in 16 batch but that doesn’t give correct results …

i don’t know, i’m really lost here …

care to elaborate?

Calling a 4096 point fft in a loop is probably not taxing the card very much. See if this code improves your benchmark. I have not tested this, so hopefully I added numBatch everywhere it was required.

extern "C" void cuFFT(int *src,int len,int numBatch){

 //numBatch represents the number of times you want to compute a len-point

 //fft on consecutive vectors stored in *src

   CUT_DEVICE_INIT();

   

cufftComplex *srcHost;

cudaMallocHost((void **)&srcHost, numBatch*len*sizeof(cufftComplex));

for(int i=0;i<len*numBatch;i++){

 srcHost[i].x=src[i];

 srcHost[i].y=0;

}

cufftHandle plan;

CUFFT_SAFE_CALL(cufftPlan1d(&plan, len, CUFFT_C2C, numBatch));

cufftComplex *srcDevice;

CUDA_SAFE_CALL(cudaMalloc((void**)&srcDevice, numBatch*len*sizeof(cufftComplex)));

cufftComplex *dstD;

CUDA_SAFE_CALL(cudaMalloc((void**)&dstD,numBatch* len*sizeof(cufftComplex)));

int *dstDevice;

CUDA_SAFE_CALL(cudaMalloc((void**)&dstDevice, numBatch*len*sizeof(int)));

CUDA_SAFE_CALL(cudaMemcpy(srcDevice, srcHost, numBatch*len*sizeof(cufftComplex), cudaMemcpyHostToDevice));

dim3 dimBlock(N_THREADS);

dim3 dimGrid((len*numBatch)/N_THREADS);

CUFFT_SAFE_CALL(cufftExecC2C(plan, srcDevice, dstD,CUFFT_FORWARD));

//cuMag<<<dimGrid,dimBlock>>>(dstD,dstDevice,len);

//CUDA_SAFE_CALL(cudaMemcpy(src, dstDevice, numBatch*len*sizeof(int), cudaMemcpyDeviceToHost));

CUFFT_SAFE_CALL(cufftDestroy(plan));

CUDA_SAFE_CALL(cudaFree(dstDevice));

CUDA_SAFE_CALL(cudaFree(dstD));

CUDA_SAFE_CALL(cudaFree(srcDevice));

cudaFreeHost(srcHost);

}

the batching you mentioned doesn’t give the correct results, because batchin means, that you can do the same FFT over a vector of windows

so if you do have to do more than one FFT on a chunk of data it might be useful (in my case it is) to reorder the data to one big 1D vector and create a FFT-plan for the whole vector

so if you use some kind of:

cufftHandle plan;

CUFFT_SAFE_CALL(cufftPlan1d(&plan, len, CUFFT_C2C, 1));

for(int i=0;i<100000;i++)

{

    CUFFT_SAFE_CALL(cufftExecC2C(plan, srcDevice, dstD,CUFFT_FORWARD));

    srcDevice += len;

    dstD+= len;

}

it would be much better to use

cufftHandle plan;

CUFFT_SAFE_CALL(cufftPlan1d(&plan, len, CUFFT_C2C, 100000));

CUFFT_SAFE_CALL(cufftExecC2C(plan, srcDevice, dstD,CUFFT_FORWARD));

as for the Profiler: there is a sticky thread in this forum “CUDA Visual Profiler”

The profiler shows runtime information for every single kernel call (but you should not rely too much on the timestamps as they are buggy in the actual version of CUDA, bug will be fixed with the release of 2.0)

for other independent timings I looked in the bandwidthTest example, they use cuda onboard timers there.

my objective is to do only one FFT.
i made the loop just to test times, because only one 4k FFT takes less than a second …
i suppose then that my code is correct and should give good timings for an individual FFT?

i wondered if batching would increase speed of one FFT but it gives bad results so i can’t use it …

maybe if i try a bigger FFT it gives better time results?

i’m open to suggestions.

Hello !

I try to run the maximum 1d, 1024pts, fft i can in a program.

Right now the maximum i reach is 95440 1k fft (thanks to your tips) computed in about 1826 ms, which is longer than IPP 1k fft…

When i try to run more than 95440 fft, i receive a “CUFFT_INVALID_VALUE”…
Any idea about that ?