CUDA 2.2 and failing CUFFT SDK example

My CUFFT related code has stopped working since installing CUDA 2.2. Any tips would be appreciated.

I installed the two following packages:

cudasdk_2.2_macos.pkg

cudatoolkit_2.2_macos_32.pkg

Most of the toolkit examples run OK. I can’t really figure out if the issues are CUFFT related. Most of the CUFFT examples fail, but others don’t (please note the MPix/s is 0.00 for the ones that fail):

$ ./simpleCUFFT

Using device 0: GeForce 9600M GT

GPU time: 229.100006 msecs. //0.000000 MPix/s

Test FAILED

Press ENTER to exit...

$ ./simpleCUFFT2

Using device 0: GeForce 9600M GT

GPU time: 229.839996 msecs. //0.000000 MPix/s

Test FAILED

$ ./convolutionFFT2D 

Using device 0: GeForce 9600M GT

Input data size		   : 1000 x 1000

Convolution kernel size   : 7 x 7

Padded image size		 : 1006 x 1006

Aligned padded image size : 1024 x 1024

Allocating memory...

Generating random input data...

Creating FFT plan for 1024 x 1024...

Uploading to GPU and padding convolution kernel and input data...

...initializing padded kernel and data storage with zeroes...

...copying input data and convolution kernel from host to CUDA arrays

...binding CUDA arrays to texture references

...padding convolution kernel

...padding input data array

Transforming convolution kernel...

Running GPU FFT convolution...

GPU time: 42.544998 msecs. //23.504526 MPix/s

Reading back GPU FFT results...

Checking GPU results...

...running reference CPU convolution

...comparing the results

Max delta / CPU value 1.588891E-06

L2 norm: 1.902640E-07

TEST PASSED

Shutting down...

Press ENTER to exit...

My device is the first one of the below (Mac OS X 10.5.7, MBP):

$ ./deviceQuery 

There are 2 devices supporting CUDA

Device 0: "GeForce 9600M GT"

  Major revision number:						 1

  Minor revision number:						 1

  Total amount of global memory:				 536543232 bytes

  Number of multiprocessors:					 4

  Number of cores:							   32

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									0.78 GHz

  Concurrent copy and execution:				 No

Device 1: "GeForce 9400M"

  Major revision number:						 1

  Minor revision number:						 1

  Total amount of global memory:				 266010624 bytes

  Number of multiprocessors:					 2

  Number of cores:							   16

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									0.25 GHz

  Concurrent copy and execution:				 No

Test PASSED

Press ENTER to exit...

Removing (via rm -rf /Developer/CUDA and /usr/local/cuda) and reinstalling solved the problem.

Making the SDK examples must have gotten screwed up by overwriting some stuff and not others.

I noticed someone else had a similar problem with 2.1 after I had posted.

I increased the signal size in simpleCUFFT, such as:
#define SIGNAL_SIZE 1<<18
#define FILTER_KERNEL_SIZE 1<<8

In cuda 2.2, test failed.

But in cuda 2.1, test passed, even in situation:
#define SIGNAL_SIZE 1<<22

Looks like a bug in cufft.

In cuda 2.2, I checked the difference between cufft and fftw,it is very large, larger than 100, (transform size is 1<<22)
but in cuda 2.1, the difference is very small, smaller than 0.0001.

if the transform size is small, all cuda versison is ok.

I want to use cuda 2.2, because it fixed many bugs. but I must use libcufft.so, who can help me?

thanks in advance!

I’m having tons of trouble getting CUDA 2.2’s FFT support to work on my host. I have a difficult situation because I’m doing the CUDA FFTs from a JNI shared object so I can’t get much help from looking at the few example codes there are lying around.

When I cudaMemcpy the device memory back to the host after the cufftExecC2C(), the first 1/2 of the memory is full of NaN s and the second half is zero filled.
This is true no matter what size I make the FFT input data set… it’s strange.

So are these problems in 2.2 serious enough that they might cause things like that? Maybe I should go back to 2.1 ?

Fedora 10 x86_64
Cuda 2.2
gcc 4.3.2
java version “1.6.0_0”

UPDATE My problem is fixed, the cufftPlan1d() I was using had the wrong transform size (size in bytes instead of number of complex numbers in the fft)