Hey folks,

I’m getting the strangest behavior with CUFFT in the following code:

cuda_test( cudaMemcpy( Wh, Wh_u, (N/2)*sizeof(float), cudaMemcpyHostToDevice ), "go_psd -> cudaMemcpy" )

cuda_test( cudaMemcpy( zz, b_u, Nh*(Nb+1)*sizeof(float), cudaMemcpyHostToDevice ), "go_psd -> cudaMemcpy( zz )" );

all_initial_expand<<<NhNb_g,NhNb_b>>>( Nh, Nb, zz, px, Wh );

basictwiddle<<<Nh_g,Nh_b>>>( Nh, Nb, -M_PI/N, px );

debugt( "Pointer addresses: %p\n", px );

cublasSdot( Nh, Wh, 1, Wh, 1 );

cufft_test( cufftExecC2C( planC2C, px, px, CUFFT_FORWARD ), "go_psd -> cufft" );

As you can see, the cublasSdot call is a no-op; its result is discarded. However, if I remove it, cufftexecC2C() returns CUFFT_UNALIGNED_DATA errors! Why that call would change the result of the CUFFT call is beyond me. The value of the pointer ‘px’ is 0x5c00000, so it’s aligned on 256-byte boundaries. The CUBLAS call still affects the CUFFT call positively if I insert it before my custom kernels.

Any ideas? I don’t suspect memory overwrites for several reasons. First of all, this is the first code that does any computation. Prior to this, I’m simply allocating space for the data and the FFT plan, using a bisection approach to determine the largest possible batch size Nb. Secondly, with the call in place, I can verify with CPU calculation that the computations are correct. And of course, like I said, the pointer is aligned, so it shouldn’t be giving me CUFFT_UNALIGNED_DATA errors anyway—that is, unless the planC2C itself is unaligned, but that’s not my fault!

I’m seeing this on Mac OSX 10.6, latest CUDA 3.2 distribution, GeForce 9600M GT. N = 65536, Nb = 150. Here’s the deviceQuery output:

Device 0: "GeForce 9600M GT"

  CUDA Driver Version:                           3.20

  CUDA Runtime Version:                          3.20

  CUDA Capability Major/Minor version number:    1.1

  Total amount of global memory:                 536543232 bytes

  Multiprocessors x Cores/MP = Cores:            4 (MP) x 8 (Cores/MP) = 32 (Cores)

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 8192

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             256 bytes

  Clock rate:                                    1.25 GHz

  Concurrent copy and execution:                 Yes

  Run time limit on kernels:                     Yes

  Integrated:                                    No

  Support host page-locked memory mapping:       Yes

  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

  Concurrent kernel execution:                   No

  Device has ECC support enabled:                No

  Device is using TCC driver mode:               No

UPDATE: I’m getting the same CUFFT_UNALIGNED_DATA error on a Tesla C1060 on 64-bit CentOS 5.5, except it even occurs WITH the CUBLAS call.