this is taking more or less the same amount of time than intel IPP FFT function also in a 100000 loop (39 seconds). if i put it calculating the Magnitude it takes more 3 or 4 seconds, which is even worse, considering the ipp function is already calculating the magnitude…
is it normal for the cufft function to take as much time as ipp? if not then what am i doing wrong? is there anything i can optimize? the cycle only has the fftexec function so i’m lost in optimizing …
i still have to optimize the Magnitude function so i’d be much happy if the fft time was better than ipp, or else my study results would be disapointing :(
I’m experiencing really good timings with FFTs up to 8k points and a 1D-length of about 32MB of data.
If you calculate too less FFT-points you might not use the whole capability of your device and so do not reach better timings.
other thing: you wrote you are comparing your timings to the intel IPP functions, do you know where I could get performance information about those IPP procedures for I would like to compare my code too?
i have a 8800 GT.
i’m using 4096 samples array…
i have done a C++ application for IPP testing, i can share it with you if you’d like…
i’m using the same application core for cuda, i just changed the FFT function, obviously.
i think 4K arrays should present good results, no?
my CPU is a P4 3.2 Ghz which is even worse for my results …
i get a little increase in speed if i increase the loop number, but it’s nothing extraordinary.
I primarily use batched FFTs for they create more threads and should be able to hide memory latencies, in my example the 4k FFT itself takes about 9ms to execute a batch of 1024 windows (= 32MB of complex data). I’m using a tesla C870 and therefore have 2 more multiprocessors, but I think that shouldn’t be the reason for this big performance gap.
Have you tried the profiler for getting the results of every single kernel call? In your case there might be a lot of overhead because of the lots of kernel calls you are invoking from the host.
Calling a 4096 point fft in a loop is probably not taxing the card very much. See if this code improves your benchmark. I have not tested this, so hopefully I added numBatch everywhere it was required.
extern "C" void cuFFT(int *src,int len,int numBatch){
//numBatch represents the number of times you want to compute a len-point
//fft on consecutive vectors stored in *src
CUT_DEVICE_INIT();
cufftComplex *srcHost;
cudaMallocHost((void **)&srcHost, numBatch*len*sizeof(cufftComplex));
for(int i=0;i<len*numBatch;i++){
srcHost[i].x=src[i];
srcHost[i].y=0;
}
cufftHandle plan;
CUFFT_SAFE_CALL(cufftPlan1d(&plan, len, CUFFT_C2C, numBatch));
cufftComplex *srcDevice;
CUDA_SAFE_CALL(cudaMalloc((void**)&srcDevice, numBatch*len*sizeof(cufftComplex)));
cufftComplex *dstD;
CUDA_SAFE_CALL(cudaMalloc((void**)&dstD,numBatch* len*sizeof(cufftComplex)));
int *dstDevice;
CUDA_SAFE_CALL(cudaMalloc((void**)&dstDevice, numBatch*len*sizeof(int)));
CUDA_SAFE_CALL(cudaMemcpy(srcDevice, srcHost, numBatch*len*sizeof(cufftComplex), cudaMemcpyHostToDevice));
dim3 dimBlock(N_THREADS);
dim3 dimGrid((len*numBatch)/N_THREADS);
CUFFT_SAFE_CALL(cufftExecC2C(plan, srcDevice, dstD,CUFFT_FORWARD));
//cuMag<<<dimGrid,dimBlock>>>(dstD,dstDevice,len);
//CUDA_SAFE_CALL(cudaMemcpy(src, dstDevice, numBatch*len*sizeof(int), cudaMemcpyDeviceToHost));
CUFFT_SAFE_CALL(cufftDestroy(plan));
CUDA_SAFE_CALL(cudaFree(dstDevice));
CUDA_SAFE_CALL(cudaFree(dstD));
CUDA_SAFE_CALL(cudaFree(srcDevice));
cudaFreeHost(srcHost);
}
the batching you mentioned doesn’t give the correct results, because batchin means, that you can do the same FFT over a vector of windows
so if you do have to do more than one FFT on a chunk of data it might be useful (in my case it is) to reorder the data to one big 1D vector and create a FFT-plan for the whole vector
as for the Profiler: there is a sticky thread in this forum “CUDA Visual Profiler”
The profiler shows runtime information for every single kernel call (but you should not rely too much on the timestamps as they are buggy in the actual version of CUDA, bug will be fixed with the release of 2.0)
for other independent timings I looked in the bandwidthTest example, they use cuda onboard timers there.
my objective is to do only one FFT.
i made the loop just to test times, because only one 4k FFT takes less than a second …
i suppose then that my code is correct and should give good timings for an individual FFT?
i wondered if batching would increase speed of one FFT but it gives bad results so i can’t use it …
maybe if i try a bigger FFT it gives better time results?