optimizing FFT calculation?

riclas · May 14, 2008, 5:09pm

hi, i have this function to benchmark my fft:

extern "C" void cuFFT(int *src,int len){

   CUT_DEVICE_INIT();

    

	cufftComplex *srcHost;

	cudaMallocHost((void **)&srcHost, len*sizeof(cufftComplex));

	for(int i=0;i<len;i++){

  srcHost[i].x=src[i];

  srcHost[i].y=0;	

	}

	

	cufftHandle plan;

	CUFFT_SAFE_CALL(cufftPlan1d(&plan, len, CUFFT_C2C, 1));

	

	cufftComplex *srcDevice;

	CUDA_SAFE_CALL(cudaMalloc((void**)&srcDevice, len*sizeof(cufftComplex)));

	cufftComplex *dstD;

	CUDA_SAFE_CALL(cudaMalloc((void**)&dstD, len*sizeof(cufftComplex)));

	int *dstDevice;

	CUDA_SAFE_CALL(cudaMalloc((void**)&dstDevice, len*sizeof(int)));

	CUDA_SAFE_CALL(cudaMemcpy(srcDevice, srcHost, len*sizeof(cufftComplex), cudaMemcpyHostToDevice));

	dim3 dimBlock(N_THREADS);

	dim3 dimGrid(len/N_THREADS);

	

	for(int i=0;i<100000;i++){

  CUFFT_SAFE_CALL(cufftExecC2C(plan, srcDevice, dstD,CUFFT_FORWARD));

	

  //cuMag<<<dimGrid,dimBlock>>>(dstD,dstDevice,len);

	}

	

	//CUDA_SAFE_CALL(cudaMemcpy(src, dstDevice, len*sizeof(int), cudaMemcpyDeviceToHost));

	CUFFT_SAFE_CALL(cufftDestroy(plan));

	CUDA_SAFE_CALL(cudaFree(dstDevice));

	CUDA_SAFE_CALL(cudaFree(dstD));

	CUDA_SAFE_CALL(cudaFree(srcDevice));

	

	cudaFreeHost(srcHost);

	

}

this is taking more or less the same amount of time than intel IPP FFT function also in a 100000 loop (39 seconds). if i put it calculating the Magnitude it takes more 3 or 4 seconds, which is even worse, considering the ipp function is already calculating the magnitude…

is it normal for the cufft function to take as much time as ipp? if not then what am i doing wrong? is there anything i can optimize? the cycle only has the fftexec function so i’m lost in optimizing …

i still have to optimize the Magnitude function so i’d be much happy if the fft time was better than ipp, or else my study results would be disapointing :(

VrahoK · May 14, 2008, 8:33pm

which hardware do you use?

which FFT-length do you calculate?

I’m experiencing really good timings with FFTs up to 8k points and a 1D-length of about 32MB of data.

If you calculate too less FFT-points you might not use the whole capability of your device and so do not reach better timings.

other thing: you wrote you are comparing your timings to the intel IPP functions, do you know where I could get performance information about those IPP procedures for I would like to compare my code too?

thanks,

Vra

riclas · May 14, 2008, 9:32pm

thank you for answering.

i have a 8800 GT.
i’m using 4096 samples array…
i have done a C++ application for IPP testing, i can share it with you if you’d like…
i’m using the same application core for cuda, i just changed the FFT function, obviously.

i think 4K arrays should present good results, no?

my CPU is a P4 3.2 Ghz which is even worse for my results …
i get a little increase in speed if i increase the loop number, but it’s nothing extraordinary.

VrahoK · May 14, 2008, 10:03pm

the 4k FFT should show good behaviour…

I primarily use batched FFTs for they create more threads and should be able to hide memory latencies, in my example the 4k FFT itself takes about 9ms to execute a batch of 1024 windows (= 32MB of complex data). I’m using a tesla C870 and therefore have 2 more multiprocessors, but I think that shouldn’t be the reason for this big performance gap.

Have you tried the profiler for getting the results of every single kernel call? In your case there might be a lot of overhead because of the lots of kernel calls you are invoking from the host.

riclas · May 14, 2008, 10:46pm

if i do 1000 FFT of 4096 samples i get less than a second too. (i’m not using milisecond measures, although i could search to use it)

thing is, i need the results of the FFT for analysis and i tried to batch it like 1024 in 4 or 256 in 16 batch but that doesn’t give correct results …

i don’t know, i’m really lost here …

care to elaborate?

mattb3 · May 15, 2008, 1:09am

Calling a 4096 point fft in a loop is probably not taxing the card very much. See if this code improves your benchmark. I have not tested this, so hopefully I added numBatch everywhere it was required.

extern "C" void cuFFT(int *src,int len,int numBatch){

 //numBatch represents the number of times you want to compute a len-point

 //fft on consecutive vectors stored in *src

   CUT_DEVICE_INIT();

   

cufftComplex *srcHost;

cudaMallocHost((void **)&srcHost, numBatch*len*sizeof(cufftComplex));

for(int i=0;i<len*numBatch;i++){

 srcHost[i].x=src[i];

 srcHost[i].y=0;

}

cufftHandle plan;

CUFFT_SAFE_CALL(cufftPlan1d(&plan, len, CUFFT_C2C, numBatch));

cufftComplex *srcDevice;

CUDA_SAFE_CALL(cudaMalloc((void**)&srcDevice, numBatch*len*sizeof(cufftComplex)));

cufftComplex *dstD;

CUDA_SAFE_CALL(cudaMalloc((void**)&dstD,numBatch* len*sizeof(cufftComplex)));

int *dstDevice;

CUDA_SAFE_CALL(cudaMalloc((void**)&dstDevice, numBatch*len*sizeof(int)));

CUDA_SAFE_CALL(cudaMemcpy(srcDevice, srcHost, numBatch*len*sizeof(cufftComplex), cudaMemcpyHostToDevice));

dim3 dimBlock(N_THREADS);

dim3 dimGrid((len*numBatch)/N_THREADS);

CUFFT_SAFE_CALL(cufftExecC2C(plan, srcDevice, dstD,CUFFT_FORWARD));

//cuMag<<<dimGrid,dimBlock>>>(dstD,dstDevice,len);

//CUDA_SAFE_CALL(cudaMemcpy(src, dstDevice, numBatch*len*sizeof(int), cudaMemcpyDeviceToHost));

CUFFT_SAFE_CALL(cufftDestroy(plan));

CUDA_SAFE_CALL(cudaFree(dstDevice));

CUDA_SAFE_CALL(cudaFree(dstD));

CUDA_SAFE_CALL(cudaFree(srcDevice));

cudaFreeHost(srcHost);

}

VrahoK · May 15, 2008, 7:32am

the batching you mentioned doesn’t give the correct results, because batchin means, that you can do the same FFT over a vector of windows

so if you do have to do more than one FFT on a chunk of data it might be useful (in my case it is) to reorder the data to one big 1D vector and create a FFT-plan for the whole vector

so if you use some kind of:

cufftHandle plan;

CUFFT_SAFE_CALL(cufftPlan1d(&plan, len, CUFFT_C2C, 1));

for(int i=0;i<100000;i++)

{

    CUFFT_SAFE_CALL(cufftExecC2C(plan, srcDevice, dstD,CUFFT_FORWARD));

    srcDevice += len;

    dstD+= len;

}

it would be much better to use

cufftHandle plan;

CUFFT_SAFE_CALL(cufftPlan1d(&plan, len, CUFFT_C2C, 100000));

CUFFT_SAFE_CALL(cufftExecC2C(plan, srcDevice, dstD,CUFFT_FORWARD));

as for the Profiler: there is a sticky thread in this forum “CUDA Visual Profiler”

The profiler shows runtime information for every single kernel call (but you should not rely too much on the timestamps as they are buggy in the actual version of CUDA, bug will be fixed with the release of 2.0)

for other independent timings I looked in the bandwidthTest example, they use cuda onboard timers there.

riclas · May 15, 2008, 1:02pm

my objective is to do only one FFT.
i made the loop just to test times, because only one 4k FFT takes less than a second …
i suppose then that my code is correct and should give good timings for an individual FFT?

i wondered if batching would increase speed of one FFT but it gives bad results so i can’t use it …

maybe if i try a bigger FFT it gives better time results?

i’m open to suggestions.

Electro · May 26, 2008, 3:40pm

Hello !

I try to run the maximum 1d, 1024pts, fft i can in a program.

Right now the maximum i reach is 95440 1k fft (thanks to your tips) computed in about 1826 ms, which is longer than IPP 1k fft…

When i try to run more than 95440 fft, i receive a “CUFFT_INVALID_VALUE”…
Any idea about that ?

Topic		Replies	Views
what's cufft batch? CUDA Programming and Performance	3	13598	May 26, 2008
Bad Performance of CUFFT library? compilation flags for optimizing fft performance CUDA Programming and Performance	11	13649	February 17, 2012
Cuda FFT running faster than simple copies and magnitude squared CUDA Programming and Performance	3	1035	January 19, 2016
Benchmarking Paricular Sized CUFFT I have a CUFFT, and I can't seem to get anywhere near optimal CUDA Programming and Performance	0	2236	April 27, 2009
CUFFT Newbei Question CUDA Programming and Performance	1	2936	May 4, 2010
CUFFT performance not good How to correctly find the excution time on CPU and GPU CUDA Programming and Performance	1	1066	May 4, 2010
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	242370	November 16, 2011
cuFFT Timing Jetson TX2	14	2565	October 18, 2021
Profiling using cuFFT GPU-Accelerated Libraries	9	963	December 5, 2019
CUDA enabled cuFFT slower than general purpose FFTW Jetson TX2	10	4554	October 18, 2021

optimizing FFT calculation?

Related topics