I installed cuda v2.3 on windows xp 32bit professional OS platform. GTX260+ . If nx=16384, functions after cufftExecC2C always return CUDA_ERROR_LAUNCH_FAILED. If nx=8192, it’s OK. But with cuda v2.2, there is no such issue. Code as follow:
if ( CUFFT_SUCCESS != flag ){
printf("Error: cufftExecZ2Z or cufftExecC2C fails \n");
}else{
printf("forward cuFFT C2C in place is O.K. \n" );
}
CUresult ret = cuCtxSynchronize();
if ( CUFFT_SUCCESS != ret ){
printf("Error: cuCtxSynchronize() fails \n");
}else{
printf("cuCtxSynchronize() works \n" );
}
// ret = cuMemcpyHtoD(idata_d, idata_h, (nx)batchsizeof(cufftComplex) );
CUDA_SAFE_CALL(cudaMemcpy(idata_d, idata_h, (nx)*batch*sizeof(cufftComplex), cudaMemcpyHostToDevice) );
//Sometimes the return value will be CUFFT_EXEC_FAILED
cout << "step 5: forward FFT" << endl ;
result = cufftExecC2C(plan, (cufftComplex *)idata_d, (cufftComplex *)idata_d, CUFFT_FORWARD);
if ( CUFFT_SUCCESS != result ){
printf("Error: cufftExecZ2Z or cufftExecC2C fails \n");
}else{
printf("forward cuFFT C2C in place is O.K. \n" );
}
//LOOK HERE: return value will always be CUDA_ERROR_LAUNCH_FAILED when nx=16384
ret = cuCtxSynchronize();
if ( CUFFT_SUCCESS != ret ){
printf("Error: cuCtxSynchronize() fails \n");
}else{
printf("cuCtxSynchronize() works \n" );
}
/*
cudaError_t flag = cudaThreadSynchronize() ;
if ( cudaSuccess != flag ){
printf("Error: cudaThreadSynchronize() fails \n");
}else{
printf("cudaThreadSynchronize() works \n" );
}
Actually, I want to use cuda driver API in my program.
I found that if nx=16384 , my program will be fail. If nx=8192 the program will be successful.
If nx is not 2^n, the program will be fail too. If you use cuda runtime function, the case is different.
I attached the vc2005 solution files for testing this issue.
This issue will be found when GPU type is GT200 and CUDA version is 2.3 on windows OS. cufftTest.zip (3.87 KB)
I have a modified version of simpleCUFFT I have run on 2.3 (comparing numerical results to another library), running FFT’s from 2^6 through 2^20:
points 64 cycles 1
dp time 4.00 us 480.00 MFLOPS
sp time 2.00 us 960.00 MFLOPS
RMSPP 0.0000000000
cuda 6 64
system 135.00 us 14.22 MFLOPS
gpu 28.00 us 68.57 MFLOPS
I/O 107.00 us 9.57 MB/s
1.48% faster than SP CPU
RMSPP 0.0000000000
points 128 cycles 1
dp time 4.00 us 1120.00 MFLOPS
sp time 2.00 us 2240.00 MFLOPS
RMSPP 0.0000000000
cuda 7 128
system 125.00 us 35.84 MFLOPS
gpu 29.00 us 154.48 MFLOPS
I/O 96.00 us 21.33 MB/s
1.60% faster than SP CPU
RMSPP 0.0000000000
points 256 cycles 1
dp time 7.00 us 1462.86 MFLOPS
sp time 5.00 us 2048.00 MFLOPS
RMSPP 0.0000000000
cuda 8 256
system 127.00 us 80.63 MFLOPS
gpu 27.00 us 379.26 MFLOPS
I/O 100.00 us 40.96 MB/s
3.94% faster than SP CPU
RMSPP 0.0000000000
points 512 cycles 1
dp time 12.00 us 1920.00 MFLOPS
sp time 10.00 us 2304.00 MFLOPS
RMSPP 0.0000000000
cuda 9 512
system 172.00 us 133.95 MFLOPS
gpu 81.00 us 284.44 MFLOPS
I/O 91.00 us 90.02 MB/s
5.81% faster than SP CPU
RMSPP 0.0000000000
points 1024 cycles 1
dp time 26.00 us 1969.23 MFLOPS
sp time 23.00 us 2226.09 MFLOPS
RMSPP 0.0000000000
cuda 10 1024
system 182.00 us 281.32 MFLOPS
gpu 83.00 us 616.87 MFLOPS
I/O 99.00 us 165.49 MB/s
12.64% faster than SP CPU
RMSPP 0.0000000000
points 2048 cycles 1
dp time 54.00 us 2085.93 MFLOPS
sp time 45.00 us 2503.11 MFLOPS
RMSPP 0.0000000000
cuda 11 2048
system 184.00 us 612.17 MFLOPS
gpu 82.00 us 1373.66 MFLOPS
I/O 102.00 us 321.25 MB/s
24.46% faster than SP CPU
RMSPP 0.0000000000
points 4096 cycles 1
dp time 122.00 us 2014.43 MFLOPS
sp time 107.00 us 2296.82 MFLOPS
RMSPP 0.0000000000
cuda 12 4096
system 200.00 us 1228.80 MFLOPS
gpu 88.00 us 2792.73 MFLOPS
I/O 112.00 us 585.14 MB/s
53.50% faster than SP CPU
RMSPP 0.0000000000
points 8192 cycles 1
dp time 309.00 us 1723.24 MFLOPS
sp time 230.00 us 2315.13 MFLOPS
RMSPP 0.0000000000
cuda 13 8192
system 241.00 us 2209.46 MFLOPS
gpu 104.00 us 5120.00 MFLOPS
I/O 137.00 us 956.73 MB/s
95.44% faster than SP CPU
RMSPP 0.0000000000
points 16384 cycles 1
dp time 659.00 us 1740.33 MFLOPS
sp time 557.00 us 2059.03 MFLOPS
RMSPP 0.0000000000
cuda 14 16384
system 303.00 us 3785.08 MFLOPS
gpu 106.00 us 10819.62 MFLOPS
I/O 197.00 us 1330.68 MB/s
183.83% faster than SP CPU
RMSPP 0.0000000000
points 32768 cycles 1
dp time 1395.00 us 1761.72 MFLOPS
sp time 1202.00 us 2044.59 MFLOPS
RMSPP 0.0000000000
cuda 15 32768
system 330.00 us 7447.27 MFLOPS
gpu 92.00 us 26713.04 MFLOPS
I/O 238.00 us 2202.89 MB/s
364.24% faster than SP CPU
RMSPP 0.0000000000
points 65536 cycles 1
dp time 2940.00 us 1783.29 MFLOPS
sp time 2545.00 us 2060.07 MFLOPS
RMSPP 0.0000000000
cuda 16 65536
system 448.00 us 11702.86 MFLOPS
gpu 94.00 us 55775.32 MFLOPS
I/O 354.00 us 2962.08 MB/s
568.08% faster than SP CPU
RMSPP 0.0000000000
points 131072 cycles 1
dp time 6411.00 us 1737.81 MFLOPS
sp time 5542.00 us 2010.31 MFLOPS
RMSPP 0.0000000000
cuda 17 131072
system 698.00 us 15961.49 MFLOPS
gpu 99.00 us 112536.57 MFLOPS
I/O 599.00 us 3501.09 MB/s
793.98% faster than SP CPU
RMSPP 0.0000000000
points 262144 cycles 1
dp time 14175.00 us 1664.41 MFLOPS
sp time 11466.00 us 2057.65 MFLOPS
RMSPP 0.0000000000
cuda 18 262144
system 1080.00 us 21845.33 MFLOPS
gpu 112.00 us 210651.43 MFLOPS
I/O 968.00 us 4332.96 MB/s
1061.67% faster than SP CPU
RMSPP 0.0000000000
points 524288 cycles 1
dp time 34685.00 us 1435.99 MFLOPS
sp time 27340.00 us 1821.78 MFLOPS
RMSPP 0.0000000000
cuda 19 524288
system 1824.00 us 27306.67 MFLOPS
gpu 139.00 us 358326.33 MFLOPS
I/O 1685.00 us 4978.40 MB/s
1498.90% faster than SP CPU
RMSPP 0.0000000000
points 1048576 cycles 1
dp time 68372.00 us 1533.63 MFLOPS
sp time 52996.00 us 1978.59 MFLOPS
RMSPP 0.0000000000
cuda 20 1048576
system 3473.00 us 30192.23 MFLOPS
gpu 200.00 us 524288.00 MFLOPS
I/O 3273.00 us 5125.94 MB/s
1525.94% faster than SP CPU
RMSPP 0.0000000000
deviceQuery output:
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: “GeForce GTX 280”
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 1073020928 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.35 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Thank you very much for your attention, redgatormc!
Yes, there will be no problem when you run a cufft program like simpleCUFFT written with CUDA runtime API.
The issue just happens when you use CUDA driver API to allocate memory and execute cufft with v2.3.
I’m confused that if I misapplied cufft functions why my test program (refer to my attachment) can works well on cufft v2.2.