CUFFT Erroneous Result for N >= 32768 Random behaviour across multiple runs of the program.

Hi,

The GeForce GTX 580 (Fermi-class) I am working on seems to have trouble performing complex-to-complex FFTs with N greater than or equal to 32768, as the attached code demonstrates.

Just to give you some background, I am reading single-byte samples packed in the following manner in a file: Real(Signal 1), Imaginary(Signal 1), Real(Signal 2), Imaginary(Signal 2). (Each byte is a signed char, taking values between, -128 and 127.) I read these into a char4 array, use a custom function to copy them to two float2 arrays corresponding to each signal, and perform a complex-to-complex FFT on each signal one after the other (I did try cufftPlanMany(), but I get similar erroneous output).

The problem is that each time I run the program, some of the FFT output channels differ in their values, compared to the previous run. The behaviour appears to be semi-deterministic. For example, the bad sections of output seem to be ~32768 points (= N) long, but sometimes there are lone channels that are bad. For N < 32768 (I tried 16K, 4K, 1K), the program works flawlessly.

I use CUDA 4.0 on RHEL 5.6. Details of the GPU that might be relevant:

CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 1535 MBytes
(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA Cores
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Any help will be greatly appreciated! Thank you!

(I did try attaching a sample data file, but it seems I am not permitted to upload binary files.)
testfft.cu (3.13 KB)

UPDATE: I tried the program on a GeForce GT 230M on a laptop (with number of threads per block changed accordingly), running CUDA 3.0 with Compute Capability 1.2 and it works. So could the problem I’m facing on the Fermi a CUDA version issue, or a hardware issue? (It’s the Fermi that I need to run the program on, so this is still an open issue.)

I found out that it’s not a problem with CUFFT. Something’s going wrong with my CopyDataForFFT() kernel - each time I run the program, I get a different number of ‘Invalid global write of size 8’ errors, at seemingly random thread and block indices.

Perhaps the moderator could close this thread? Now that it’s been found not to be a CUFFT problem, I’m posting this as a new topic. Thanks!