# cuFFT DFT Performance question

I have a performance question regarding cuFFT using Complex-to-Complex forward FFT on 1D array - no errors or unexpected data, just performance question.

The observed performance for cuFFT forward FFT drops significantly when the array length is 22,097,157 (4,194,314), while array sizes 21,048,576 (2,097,152) and 2*4,194,304 (8,388,608) perform as expected. Is this an issue with FFT in general, an artifact of the employed algorithms of FFT?

example timing where array size is N:

N – DEVICE Time (ms.)
2,097,152 – 0.96
4,194,314 – 253.50
8,388,608 – 2.15

I am using a single GPU (Titan V) with compute architecture 10.1. A sample of the code I am using follows:

``````#include <cufft.h>
#include <stdio.h>

// I left off CUDA timing and error handling as I just want to
// know if there is something I am doing wrong with calling the
// cuFFT library
int main(){
// length of array - when N is 2097157 the performance is
// significantly worse than either 1048576 or 4194304
const unsigned int N = 2097157;

cuComplex *darray, *harray, *result;
harray = (cuComplex*)malloc(2*N*sizeof(cuComplex);
result = (cuComplex*)malloc(2*N*sizeof(cuComplex);
cudaMalloc((void**)&darray, 2*N*sizeof(cuComplex));

// initialize
for(unsigned int i = 0; i < 2*N; ++i){
harray[i].x = (float)i;
harray[i].y = 1.0f;
}

// copy to DEVICE
cudaMemcpy(darray, harray, 2*N*sizeof(cuComplex), cudaMemcpyHostToDevice);

// Didn't wrap these calls in error macros
cufftHandle plan;
cufftPlan1d(&plan, 2*N, CUFFT_C2C, 1);
cufftExecC2C(plan, darray, darray, CUFFT_FORWARD);
cufftDestroy(plan);

// copy to HOST
cudaMemcpy(result, darray, 2*N*sizeof(cuComplex), cudaMemcpyDeviceToHost);

free(harray);
free(result);
cudaFree(darray);

return 0;
}
``````

Any ideas or hints as to why this behavior occurs would be great.

Thank you