my speedy FFT 3x faster than CUFFT

N = the number of FFT points

What is the batch parameter? Is this the number of FFTs? Can you create a FFT-Library?

This is just a benchmark and comparison of two different FFT algorithms. There is no windowing, though it is easy enough to implement if you desire.

Batch is the number of FFT’s executed. The FFT’s are batched to group the memory into one transfer and to reduce the overhead associated with kernel launch. You have to be careful when comparing numbers from different benchmarks - in some cases the memory transfer is included, in others it’s not.

You can easily reuse this implementation, just download the code vvolkov provided in the first post. In some cases it requires some unusual ordering of the input data which may or may not impose additional load depending on your application. You can also use the CUFFT library provided by NVidia (the code provided in the first post also calls the CUFFT library by way of comparison).

There’s a Hann window, and a Hamming window (not sure which one you mean, probably the latter)…either way, you could easily write another small kernel to apply the windowing function of your choice to your input data before you run the FFT on it.

The wikipedia article you link to says that “Hanning window” is sometimes used to refer to the Hann window (for maximum confusion).

Ah, didn’t even see that :">

It’s seems i am unable to run your FFT on Windows in Matlab.
First i tried to compile with matlab, my uasge was:

nvmex -f nvmexopts.bat -IG:\cuda\include -LG:\cuda\lib -lcufft -lcudart
it wrotes:
LINK : error LNK2001: unresolved external symbol mexFunction
G:\DOCUME~1\ADMINI~1.EXP\LOCALS~1\Temp\mex_100\templib.x : fatal error LNK1120: 1 unresolved externals
E:\MATLAB\R2008A\BIN\NVMEX.PL: Error: Link of ‘FFT512.mexw32’ failed.

So I load your Project file into VC8 and build it.
Not I tried to compile the obj-file with mex, using:

mex FFT512.obj
it writes out nearly the same:
LINK : error LNK2001: unresolved external symbol mexFunction
G:\DOCUME~1\ADMINI~1.EXP\LOCALS~1\Temp\mex_VQu1VM\templib.x : fatal error LNK1120: 1 unresolved externals
E:\MATLAB\R2008A\BIN\MEX.PL: Error: Link of ‘FFT512.mexw32’ failed.

did you run mex -setup? This error could be caused by an invalid configuration of mex.

You need to link against the matlab library, I think adding -lmex will do that (not sure though, and it should not be necessary)

Argh, i totally missed to get an mexfile out of the C-code… :unsure:
now i am trying to get mexFunction( int nlhs, mxArray *plhs, int nrhs, const mxArray *prhs) as a kind of main.
But i am very unshure about usage of the function itself.
I will try FFT512_device<<< grid2D(1), 64 >>>( myArrayOnGeForce );

Well, i dont get it to manage the second dimesions…
I only want to make a 1-D-FFT, but it only accepts float2, i dont know how to say: only 1-d-fft, please :wacko:
i attached the modified

Is it possible to use for 2D and 3D? 4D ? :rolleyes:

It is for 2D, isn’t it?

am i such a noob, don’t getting even this ? :">

float2 is for complex numbers, it has nothing to do with 1D or 2D FFTs

My problem is that to obtain the output in the same format of the CUFFT the host transpose() function is needed, using this function the gain obtained using speedy Volkov FFT is lose (in my application I need to transfer data from device to host, transpose and transfer data from host to device for more processing). How can I obtain a device version of transpose function , is this obtainable just replacing malloc with cudaMalloc and memcpy with cudaMemcpy? a kernel may be a solution? Can someone help me? thanks!

But why do you need the same format as in CUFFT? What kind of processing do you do after FFT?

I process an image in different patches, for each patch i transform the data in frequency domain using fft in batch mode making an fft for each column in my patch,a data correction operating on columns is applied. Then a time domain adapted filter for each column is transformed in the frequency domain and multiply each column of my sub image. An ifft is used to carry back the result in time domain. At the end all patches restores the original image . During the processing i use the row-major indexing to access the data in the array.

So, you first do FFT, then some kind of transformation, then IFFT. In many similar cases the transformation is elementwise scaling, i.e. each data entry is independently multiplied by some number. Is that you case?

You do know that there is a transpose example in the SDK?

No, in my transformation I focus energy from different points, so the data are modified using part of the energy in the other points along a column. I think my problem is both in single patch processing and in the image restoration. In the processing I may try to change rows with columns in the data correction, but in the patches??? each patch at the end of the processing is written in a file.

Yes I know, I’ll try it on monday too.

Can you create a DLL from your prototype FFT?

Is it possible to estimate FFTs per second from the flops figure for a given window size?


Using the IFFT-011209 version, I had to add -lcudart to the Makefile in order to get it to link on MacOS X … I just thought that this might be helpful to others trying to use this code.