cufft maximum 1d size with batch cuda error doing very large 1dffts

Hello all,

I am testing the maximum array length, that I can use cufft on on my machine. as described in the cufft manual I can go up to 2 ** 23 = 8388608 with a batch of 1.
The data used is an array of real*4 with length N * BATCH. The transformation is done in place and I should have 4 gb memory on the device available.

As mentioned in the manual, I get an cudaError: unspecified launch Error when executing R2C when going higher… (2 ** 24 & 2 ** 25)
when I try to launch with 2 “” 26 I get cudaError: invalid configuration argument so the Error changes
when I go to 2 ** 28 I get the following:

/home/buildmeister/build/rel/gpgpu/toolkit/r3.0/cufft/src/accel/sp/interface/SP_FFT_interface.cpp:265: void SP_fftSetup(void**, unsigned int*, unsigned int*, unsigned int): Assertion `fftStruct->N >= 1 && fftStruct->N <= (1U << 27)’ failed.

You may ask, why do I want to test the Error Messages higher than 2 **23 ??

The thing is I get those Error messages a lot earlier using batch:

for N = 223 error BATCH = 24 : Exec R2C cudaError: invalid argument. (strange that it doesn’t say cudaError: invalid configuration argument)
for N = 2**23 error BATCH = 2 ** 5: /home/buildmeister/build/rel/gpgpu/toolkit/r3.0/cufft/src/accel/sp/interface/SP_FFT_interface.cpp:265: void SP_fftSetup … (same error as before)

I could also use any other combination, when I come to 227 in total, I get the argument error, at 228 the other SP_fftSetup error. Just from my memory size I could use a much higher batch 2 ** 29.
I guess it breaks down because of 32bit integer range? Is that true?
I suppose I have to split up my data?

Thanks for any help,


I have the following GPU:

CUDA Device #0
Major revision number: 1
Minor revision number: 3
Name: Quadro FX 5800
Total global memory: 4294246400
Total shared memory per block: 16384
Total registers per block: 16384
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 512
Maximum dimension 0 of block: 512
Maximum dimension 1 of block: 512
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 65535
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 1
Clock rate: 1296000
Total constant memory: 65536
Texture alignment: 256
Concurrent copy and execution: Yes
Number of multiprocessors: 30
Kernel execution timeout: Yes