The GeForce GTX 580 (Fermi-class) I am working on seems to have trouble performing complex-to-complex FFTs with N greater than or equal to 32768, as the attached code demonstrates.
Just to give you some background, I am reading single-byte samples packed in the following manner in a file: Real(Signal 1), Imaginary(Signal 1), Real(Signal 2), Imaginary(Signal 2). (Each byte is a signed char, taking values between, -128 and 127.) I read these into a char4 array, use a custom function to copy them to two float2 arrays corresponding to each signal, and perform a complex-to-complex FFT on each signal one after the other (I did try cufftPlanMany(), but I get similar erroneous output).
The problem is that each time I run the program, some of the FFT output channels differ in their values, compared to the previous run. The behaviour appears to be semi-deterministic. For example, the bad sections of output seem to be ~32768 points (= N) long, but sometimes there are lone channels that are bad. For N < 32768 (I tried 16K, 4K, 1K), the program works flawlessly.
I use CUDA 4.0 on RHEL 5.6. Details of the GPU that might be relevant:
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 1535 MBytes
(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA Cores
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Any help will be greatly appreciated! Thank you!
(I did try attaching a sample data file, but it seems I am not permitted to upload binary files.)
testfft.cu (3.13 KB)