I’m getting the strangest behavior with CUFFT in the following code:
cuda_test( cudaMemcpy( Wh, Wh_u, (N/2)*sizeof(float), cudaMemcpyHostToDevice ), "go_psd -> cudaMemcpy" ) cuda_test( cudaMemcpy( zz, b_u, Nh*(Nb+1)*sizeof(float), cudaMemcpyHostToDevice ), "go_psd -> cudaMemcpy( zz )" ); all_initial_expand<<<NhNb_g,NhNb_b>>>( Nh, Nb, zz, px, Wh ); basictwiddle<<<Nh_g,Nh_b>>>( Nh, Nb, -M_PI/N, px ); debugt( "Pointer addresses: %p\n", px ); cublasSdot( Nh, Wh, 1, Wh, 1 ); cufft_test( cufftExecC2C( planC2C, px, px, CUFFT_FORWARD ), "go_psd -> cufft" );
As you can see, the cublasSdot call is a no-op; its result is discarded. However, if I remove it, cufftexecC2C() returns CUFFT_UNALIGNED_DATA errors! Why that call would change the result of the CUFFT call is beyond me. The value of the pointer ‘px’ is 0x5c00000, so it’s aligned on 256-byte boundaries. The CUBLAS call still affects the CUFFT call positively if I insert it before my custom kernels.
Any ideas? I don’t suspect memory overwrites for several reasons. First of all, this is the first code that does any computation. Prior to this, I’m simply allocating space for the data and the FFT plan, using a bisection approach to determine the largest possible batch size Nb. Secondly, with the call in place, I can verify with CPU calculation that the computations are correct. And of course, like I said, the pointer is aligned, so it shouldn’t be giving me CUFFT_UNALIGNED_DATA errors anyway—that is, unless the planC2C itself is unaligned, but that’s not my fault!
I’m seeing this on Mac OSX 10.6, latest CUDA 3.2 distribution, GeForce 9600M GT. N = 65536, Nb = 150. Here’s the deviceQuery output:
Device 0: "GeForce 9600M GT" CUDA Driver Version: 3.20 CUDA Runtime Version: 3.20 CUDA Capability Major/Minor version number: 1.1 Total amount of global memory: 536543232 bytes Multiprocessors x Cores/MP = Cores: 4 (MP) x 8 (Cores/MP) = 32 (Cores) Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 2147483647 bytes Texture alignment: 256 bytes Clock rate: 1.25 GHz Concurrent copy and execution: Yes Run time limit on kernels: Yes Integrated: No Support host page-locked memory mapping: Yes Compute mode: Default (multiple host threads can use this device simultaneously) Concurrent kernel execution: No Device has ECC support enabled: No Device is using TCC driver mode: No
UPDATE: I’m getting the same CUFFT_UNALIGNED_DATA error on a Tesla C1060 on 64-bit CentOS 5.5, except it even occurs WITH the CUBLAS call.