I’ve been struggling to get batched 1D fft’s to work using CUFFT. Relevant code looks like (ie. ignoring memory allocation, data transfers to/from device, etc.)
cufftPlan1d(&fftPlan,fftSize,CUFFT_R2C,batch);
cufftExecR2C(fftPlan,(cufftReal *) bufIn, (cufftComplex *) bufOut);
I was having trouble getting it to work if batch > 1 (so batch processing wasnt working). Finally, after lots of searching on the web I found this paper:
www.cscamm.umd.edu/publications/GPUturb_CS-08-33.pdf
(High-performance Computation and Visualization of Plasma Turbulence on Graphics Processors,
by Stantchev, Juba, Dorland, and Varshney).
Deep in that paper, it says that if you use batch>1 then the fftsize must be <= 16k.
This little factoid would be good to put right in the CUFFT_Library documentation for cufftPlan1d(). Furthermore if you try to set a batch > 1 for an fftSize > 16k, then cufftPlan1d() should return some sort of error code. In CUDA v2.0 its returning CUFFT_SUCCESS.
Whoops wrote too soon: I’m still having problems. If I do a 16k fft size, then I can get 3 fft’s done in batch, but the 4th is bad. If I do 8k fft size I can get 6 done, but the 7th is bad. A 4k fft size is good up thru 12 but the 13th is bad. So perhaps there’s some relationship between fftSize*batch and the size of one of the memory banks?
EDIT #2: PLEASE ignore the above…the max FFT size for batch>1 was a red herring…sorry. I simplified my code and got fftsize>16k with a batch>1.