(Uncodumented?) limit of fftsize for 1d batch ffts Problem w/CUFFT docs and w/cufftPlan1d

I’ve been struggling to get batched 1D fft’s to work using CUFFT. Relevant code looks like (ie. ignoring memory allocation, data transfers to/from device, etc.)

cufftPlan1d(&fftPlan,fftSize,CUFFT_R2C,batch);
cufftExecR2C(fftPlan,(cufftReal *) bufIn, (cufftComplex *) bufOut);

I was having trouble getting it to work if batch > 1 (so batch processing wasnt working). Finally, after lots of searching on the web I found this paper:

www.cscamm.umd.edu/publications/GPUturb_CS-08-33.pdf

(High-performance Computation and Visualization of Plasma Turbulence on Graphics Processors,
by Stantchev, Juba, Dorland, and Varshney).

Deep in that paper, it says that if you use batch>1 then the fftsize must be <= 16k.

This little factoid would be good to put right in the CUFFT_Library documentation for cufftPlan1d(). Furthermore if you try to set a batch > 1 for an fftSize > 16k, then cufftPlan1d() should return some sort of error code. In CUDA v2.0 its returning CUFFT_SUCCESS.

Whoops wrote too soon: I’m still having problems. If I do a 16k fft size, then I can get 3 fft’s done in batch, but the 4th is bad. If I do 8k fft size I can get 6 done, but the 7th is bad. A 4k fft size is good up thru 12 but the 13th is bad. So perhaps there’s some relationship between fftSize*batch and the size of one of the memory banks?

EDIT #2: PLEASE ignore the above…the max FFT size for batch>1 was a red herring…sorry. I simplified my code and got fftsize>16k with a batch>1.