After installing cuda v2.3 , Can not execute 16K FFT . With release v2.3 , function after fft always

I installed cuda v2.3 on windows xp 32bit professional OS platform. GTX260+ . If nx=16384, functions after cufftExecC2C always return CUDA_ERROR_LAUNCH_FAILED. If nx=8192, it’s OK. But with cuda v2.2, there is no such issue. Code as follow:

[codebox]

#include “stdafx.h”

#include “windows.h”

#include “math.h”

#include “cuda.h”

#include “cufft.h”

#include

using namespace std;

#pragma comment(lib, “cufft.lib”)

#pragma comment(lib, “cuda.lib”)

#define NX 16384

#define BATCH 1

int _tmain(int argc, _TCHAR* argv)

{

cufftHandle plan;

cufftResult result = CUFFT_SUCCESS;

CUresult ret = CUDA_SUCCESS;

int nx = NX;

int batch = BATCH;

result = cufftPlan1d(&plan, nx, CUFFT_C2C, batch);

CUdeviceptr idata_d;

cufftComplex *idata_h = (cufftComplex*)malloc((nx) * batch * sizeof(cufftComplex));

cuMemAlloc(&idata_d, (nx) * batch * sizeof(cufftComplex));



srand(GetTickCount());

for(int i = 0; i < (nx)*batch * 2; i++)

{

	((float*)idata_h)[i] = (double)rand() / (RAND_MAX + 1);

}

ret = cuMemcpyHtoD(idata_d, idata_h, (nx)*batch*sizeof(cufftComplex));

//Sometimes the return value will be CUFFT_EXEC_FAILED

result = cufftExecC2C(plan, (cufftComplex *)idata_d, (cufftComplex *)idata_d, CUFFT_FORWARD);

//LOOK HERE: return value will always be CUDA_ERROR_LAUNCH_FAILED when nx=16384

ret = cuCtxSynchronize();

result = cufftDestroy(plan);

ret = cuMemFree(idata_d);

free(idata_h);

getchar();

return 0;

}

[/codebox]

Is there anyone who can execute 16384 length 1D C2C FFT with cuda v2.3 ?
I suspect there is a bug in cuda fft v2.3.

I use GTX295, cuda 2.3, winxp pro 64-bit, on vc2005 and test 1D FFT complex to complex (not your code)

for nx = 16384

it works for “float” and “double”

cufftExecZ2Z() and cufftExecC2C() always return CUFFT_SUCCESS

Thank you very much for your feedback!

I have done several tests. This problem is wierd. After cufftExecC2C, in some cases cuCtxSynchronize always returns CUDA_ERROR_LAUNCH_FAILED.

Some cases I’ve tested as follow (winxp pro 32-bit & vc2005 & 1D FFT C2C nx=16384 & driver 190.38):

GTX 285 , cuda 2.3, cufftExecC2C + cuCtxSynchronize–> CUDA_ERROR_LAUNCH_FAILED

GTX 280 , cuda 2.3, cufftExecC2C + cuCtxSynchronize–> CUDA_ERROR_LAUNCH_FAILED

GTX 260+, cuda 2.3, cufftExecC2C + cuCtxSynchronize–> CUDA_ERROR_LAUNCH_FAILED

GTX 260 , cuda 2.3, cufftExecC2C + cuCtxSynchronize–> CUDA_ERROR_LAUNCH_FAILED

9800GX2 , cuda 2.3, cufftExecC2C + cuCtxSynchronize–> CUDA_SUCCESS

8800 GT , cuda 2.3, cufftExecC2C + cuCtxSynchronize–> CUDA_SUCCESS

NVS 290 , cuda 2.3, cufftExecC2C + cuCtxSynchronize–> CUDA_SUCCESS

GTX 285 , cuda 2.2, cufftExecC2C + cuCtxSynchronize–> CUDA_SUCCESS

GTX 280 , cuda 2.2, cufftExecC2C + cuCtxSynchronize–> CUDA_SUCCESS

GTX 260+, cuda 2.2, cufftExecC2C + cuCtxSynchronize–> CUDA_SUCCESS

GTX 260 , cuda 2.2, cufftExecC2C + cuCtxSynchronize–> CUDA_SUCCESS

9800GX2 , cuda 2.2, cufftExecC2C + cuCtxSynchronize–> CUDA_SUCCESS

8800 GT , cuda 2.2, cufftExecC2C + cuCtxSynchronize–> CUDA_SUCCESS

NVS 290 , cuda 2.2, cufftExecC2C + cuCtxSynchronize–> CUDA_SUCCESS

GTX 285 , cuda 2.3, cufftExecC2C + cudaThreadSynchronize–> cudaSuccess

GTX 280 , cuda 2.3, cufftExecC2C + cudaThreadSynchronize–> cudaSuccess

GTX 260+, cuda 2.3, cufftExecC2C + cudaThreadSynchronize–> cudaSuccess

GTX 260 , cuda 2.3, cufftExecC2C + cudaThreadSynchronize–> cudaSuccess

9800GX2 , cuda 2.3, cufftExecC2C + cudaThreadSynchronize–> cudaSuccess

8800 GT , cuda 2.3, cufftExecC2C + cudaThreadSynchronize–> cudaSuccess

NVS 290 , cuda 2.3, cufftExecC2C + cudaThreadSynchronize–> cudaSuccess

GTX 285 , cuda 2.2, cufftExecC2C + cudaThreadSynchronize–> cudaSuccess

GTX 280 , cuda 2.2, cufftExecC2C + cudaThreadSynchronize–> cudaSuccess

GTX 260+, cuda 2.2, cufftExecC2C + cudaThreadSynchronize–> cudaSuccess

GTX 260 , cuda 2.2, cufftExecC2C + cudaThreadSynchronize–> cudaSuccess

9800GX2 , cuda 2.2, cufftExecC2C + cudaThreadSynchronize–> cudaSuccess

8800 GT , cuda 2.2, cufftExecC2C + cudaThreadSynchronize–> cudaSuccess

NVS 290 , cuda 2.2, cufftExecC2C + cudaThreadSynchronize–> cudaSuccess

In summary, it seems that we can not call cuCtxSynchronize after cufftExecC2C with CUDA v2.3 on GT200 GPU.

LSChien, could you try to add cuCtxSynchronize after cufftExecC2C ?

platform: GTX295, cuda 2.3, driver 190.38, winxp pro 64-bit, on vc2005

program works even adding cuCtxSynchronize() after cufftExecC2C()

code of my forward FFT is

[codebox]#ifdef DO_DOUBLE

typedef cufftDoubleComplex Complex; 

#else

typedef cufftComplex  Complex; 

#endif

// forward FFT (inplace)

void lsc_1DFFTF( const unsigned int Nx, Complex *h_idata )

{

cufftHandle plan ;

Complex    *d_idata ;

size_t     mem_size_Nx = sizeof(Complex)*Nx ;

cufftResult flag ;

// step 1: transfer data to device

cutilSafeCall( cudaMalloc((void**)&d_idata, mem_size_Nx ) );

CUDA_SAFE_CALL(cudaMemcpy(d_idata, h_idata, mem_size_Nx, cudaMemcpyHostToDevice) );

// step 2: Create a 1D FFT plan.

#if defined (DO_DOUBLE)

cufftPlan1d(&plan, Nx, CUFFT_Z2Z, 1);

#else

cufftPlan1d(&plan, Nx, CUFFT_C2C, 1);

#endif

// step 3: Use the CUFFT plan to transform the signal out place.

#if defined (DO_DOUBLE)

flag = cufftExecZ2Z( plan, d_idata, d_idata, CUFFT_FORWARD);

#else

flag = cufftExecC2C( plan, d_idata, d_idata, CUFFT_FORWARD);

#endif

if ( CUFFT_SUCCESS != flag ){

	printf("Error: cufftExecZ2Z or cufftExecC2C fails \n");

}else{

	printf("forward cuFFT C2C in place is O.K. \n" );

}

CUresult ret = cuCtxSynchronize();

if ( CUFFT_SUCCESS != ret ){

	printf("Error: cuCtxSynchronize() fails \n");

}else{

	printf("cuCtxSynchronize() works \n" );

}

// step 4: copy data to host

CUDA_SAFE_CALL(cudaMemcpy(h_idata, d_idata, mem_size_Nx, cudaMemcpyDeviceToHost) );

// Destroy the CUFFT plan.

cufftDestroy(plan);

cudaFree(d_idata);

}

[/codebox]

besides I test your code, then error occurs indeed, so

I modify your code as

[codebox]//#include “stdafx.h”

//#include “windows.h”

#include “math.h”

#include “cuda.h”

#include “cufft.h”

#include

#include <cutil.h>

#include <cutil_inline.h>

using namespace std;

#pragma comment(lib, “cufft.lib”)

#pragma comment(lib, “cuda.lib”)

#define NX 16384

#define BATCH 1

int main(int argc, char* argv)

{

cufftHandle plan;	

cufftResult result = CUFFT_SUCCESS;	

CUresult ret = CUDA_SUCCESS;	

int nx = NX;	

int batch = BATCH;	



int device =  cutGetMaxGflopsDeviceId() ;

cudaSetDevice( device );

printf("test 1D CUFFT C2C with device %d\n", device);

cout << "step 1: create 1D C2C plane" << endl ;

result = cufftPlan1d(&plan, nx, CUFFT_C2C, batch);

cout << "step 2: allocate device memory" << endl ;

// CUdeviceptr idata_d;

cufftComplex *idata_d ;

cufftComplex *idata_h = (cufftComplex*)malloc((nx) * batch * sizeof(cufftComplex));	

// cuMemAlloc(&idata_d, (nx) * batch * sizeof(cufftComplex)) ;

cutilSafeCall( cudaMalloc((void**)&idata_d, (nx) * batch * sizeof(cufftComplex) ) );

cout << "step 3: random data" << endl ;

for(int i = 0; i < (nx)*batch * 2; i++)	{		

	((float*)idata_h)[i] = (double)rand() / (RAND_MAX + 1);	

}	



cout << "step 4: host to device" << endl ;

// ret = cuMemcpyHtoD(idata_d, idata_h, (nx)batchsizeof(cufftComplex) );

CUDA_SAFE_CALL(cudaMemcpy(idata_d, idata_h, (nx)*batch*sizeof(cufftComplex), cudaMemcpyHostToDevice) );

//Sometimes the return value will be CUFFT_EXEC_FAILED	



cout << "step 5: forward FFT" << endl ;

result = cufftExecC2C(plan, (cufftComplex *)idata_d, (cufftComplex *)idata_d, CUFFT_FORWARD);	

if ( CUFFT_SUCCESS != result ){

	printf("Error: cufftExecZ2Z or cufftExecC2C fails \n");

}else{

	printf("forward cuFFT C2C in place is O.K. \n" );

}



//LOOK HERE: return value will always be CUDA_ERROR_LAUNCH_FAILED when nx=16384	



ret = cuCtxSynchronize();	

if ( CUFFT_SUCCESS != ret ){

	printf("Error: cuCtxSynchronize() fails \n");

}else{

	printf("cuCtxSynchronize() works \n" );

}

/*

cudaError_t flag = cudaThreadSynchronize() ;

if ( cudaSuccess != flag ){

	printf("Error: cudaThreadSynchronize() fails \n");

}else{

	printf("cudaThreadSynchronize() works \n" );	

}

*/

cout << "step 6: de-allocate" << endl ;

result = cufftDestroy(plan);	

// ret = cuMemFree(idata_d);

cudaFree(idata_d);

free(idata_h);	



return 0;

}

[/codebox]

then it works.

If using cudaGetLastError() to test each statement, then

after FFT, error string is “invalid device pointer”, however FFT returns “CUFFT_SUCCESS”,

I have no idea about the error

Thank you, Lung Sheng!

Actually, I want to use cuda driver API in my program.
I found that if nx=16384 , my program will be fail. If nx=8192 the program will be successful.
If nx is not 2^n, the program will be fail too. If you use cuda runtime function, the case is different.

I attached the vc2005 solution files for testing this issue.

This issue will be found when GPU type is GT200 and CUDA version is 2.3 on windows OS.
cufftTest.zip (3.87 KB)

Wish nVIDIA professionals could give an explanation. Is it a bug or misuse?

I have a modified version of simpleCUFFT I have run on 2.3 (comparing numerical results to another library), running FFT’s from 2^6 through 2^20:

points 64 cycles 1
dp time 4.00 us 480.00 MFLOPS
sp time 2.00 us 960.00 MFLOPS
RMSPP 0.0000000000
cuda 6 64
system 135.00 us 14.22 MFLOPS
gpu 28.00 us 68.57 MFLOPS
I/O 107.00 us 9.57 MB/s
1.48% faster than SP CPU
RMSPP 0.0000000000
points 128 cycles 1
dp time 4.00 us 1120.00 MFLOPS
sp time 2.00 us 2240.00 MFLOPS
RMSPP 0.0000000000
cuda 7 128
system 125.00 us 35.84 MFLOPS
gpu 29.00 us 154.48 MFLOPS
I/O 96.00 us 21.33 MB/s
1.60% faster than SP CPU
RMSPP 0.0000000000
points 256 cycles 1
dp time 7.00 us 1462.86 MFLOPS
sp time 5.00 us 2048.00 MFLOPS
RMSPP 0.0000000000
cuda 8 256
system 127.00 us 80.63 MFLOPS
gpu 27.00 us 379.26 MFLOPS
I/O 100.00 us 40.96 MB/s
3.94% faster than SP CPU
RMSPP 0.0000000000
points 512 cycles 1
dp time 12.00 us 1920.00 MFLOPS
sp time 10.00 us 2304.00 MFLOPS
RMSPP 0.0000000000
cuda 9 512
system 172.00 us 133.95 MFLOPS
gpu 81.00 us 284.44 MFLOPS
I/O 91.00 us 90.02 MB/s
5.81% faster than SP CPU
RMSPP 0.0000000000
points 1024 cycles 1
dp time 26.00 us 1969.23 MFLOPS
sp time 23.00 us 2226.09 MFLOPS
RMSPP 0.0000000000
cuda 10 1024
system 182.00 us 281.32 MFLOPS
gpu 83.00 us 616.87 MFLOPS
I/O 99.00 us 165.49 MB/s
12.64% faster than SP CPU
RMSPP 0.0000000000
points 2048 cycles 1
dp time 54.00 us 2085.93 MFLOPS
sp time 45.00 us 2503.11 MFLOPS
RMSPP 0.0000000000
cuda 11 2048
system 184.00 us 612.17 MFLOPS
gpu 82.00 us 1373.66 MFLOPS
I/O 102.00 us 321.25 MB/s
24.46% faster than SP CPU
RMSPP 0.0000000000
points 4096 cycles 1
dp time 122.00 us 2014.43 MFLOPS
sp time 107.00 us 2296.82 MFLOPS
RMSPP 0.0000000000
cuda 12 4096
system 200.00 us 1228.80 MFLOPS
gpu 88.00 us 2792.73 MFLOPS
I/O 112.00 us 585.14 MB/s
53.50% faster than SP CPU
RMSPP 0.0000000000
points 8192 cycles 1
dp time 309.00 us 1723.24 MFLOPS
sp time 230.00 us 2315.13 MFLOPS
RMSPP 0.0000000000
cuda 13 8192
system 241.00 us 2209.46 MFLOPS
gpu 104.00 us 5120.00 MFLOPS
I/O 137.00 us 956.73 MB/s
95.44% faster than SP CPU
RMSPP 0.0000000000
points 16384 cycles 1
dp time 659.00 us 1740.33 MFLOPS
sp time 557.00 us 2059.03 MFLOPS
RMSPP 0.0000000000
cuda 14 16384
system 303.00 us 3785.08 MFLOPS
gpu 106.00 us 10819.62 MFLOPS
I/O 197.00 us 1330.68 MB/s
183.83% faster than SP CPU
RMSPP 0.0000000000
points 32768 cycles 1
dp time 1395.00 us 1761.72 MFLOPS
sp time 1202.00 us 2044.59 MFLOPS
RMSPP 0.0000000000
cuda 15 32768
system 330.00 us 7447.27 MFLOPS
gpu 92.00 us 26713.04 MFLOPS
I/O 238.00 us 2202.89 MB/s
364.24% faster than SP CPU
RMSPP 0.0000000000
points 65536 cycles 1
dp time 2940.00 us 1783.29 MFLOPS
sp time 2545.00 us 2060.07 MFLOPS
RMSPP 0.0000000000
cuda 16 65536
system 448.00 us 11702.86 MFLOPS
gpu 94.00 us 55775.32 MFLOPS
I/O 354.00 us 2962.08 MB/s
568.08% faster than SP CPU
RMSPP 0.0000000000
points 131072 cycles 1
dp time 6411.00 us 1737.81 MFLOPS
sp time 5542.00 us 2010.31 MFLOPS
RMSPP 0.0000000000
cuda 17 131072
system 698.00 us 15961.49 MFLOPS
gpu 99.00 us 112536.57 MFLOPS
I/O 599.00 us 3501.09 MB/s
793.98% faster than SP CPU
RMSPP 0.0000000000
points 262144 cycles 1
dp time 14175.00 us 1664.41 MFLOPS
sp time 11466.00 us 2057.65 MFLOPS
RMSPP 0.0000000000
cuda 18 262144
system 1080.00 us 21845.33 MFLOPS
gpu 112.00 us 210651.43 MFLOPS
I/O 968.00 us 4332.96 MB/s
1061.67% faster than SP CPU
RMSPP 0.0000000000
points 524288 cycles 1
dp time 34685.00 us 1435.99 MFLOPS
sp time 27340.00 us 1821.78 MFLOPS
RMSPP 0.0000000000
cuda 19 524288
system 1824.00 us 27306.67 MFLOPS
gpu 139.00 us 358326.33 MFLOPS
I/O 1685.00 us 4978.40 MB/s
1498.90% faster than SP CPU
RMSPP 0.0000000000
points 1048576 cycles 1
dp time 68372.00 us 1533.63 MFLOPS
sp time 52996.00 us 1978.59 MFLOPS
RMSPP 0.0000000000
cuda 20 1048576
system 3473.00 us 30192.23 MFLOPS
gpu 200.00 us 524288.00 MFLOPS
I/O 3273.00 us 5125.94 MB/s
1525.94% faster than SP CPU
RMSPP 0.0000000000

deviceQuery output:
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA

Device 0: “GeForce GTX 280”
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 1073020928 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.35 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Press ENTER to exit…

Thank you very much for your attention, redgatormc!

Yes, there will be no problem when you run a cufft program like simpleCUFFT written with CUDA runtime API.
The issue just happens when you use CUDA driver API to allocate memory and execute cufft with v2.3.

I’m confused that if I misapplied cufft functions why my test program (refer to my attachment) can works well on cufft v2.2.