cuFFT Performance for increasing array sizes

Please help,

First of all, I apologize for re-posting this question from another section of forums - I think the question might be better asked here. I will properly place my question(s) next time.

I have a performance question regarding cuFFT using Complex-to-Complex forward FFT on 1D array - no errors or unexpected data, just performance question.

The observed performance for cuFFT forward FFT drops significantly when the array length is 22,097,157 (4,194,314), while array sizes 21,048,576 (2,097,152) and 2*4,194,304 (8,388,608) perform as expected. Is this an issue with FFT in general, an artifact of the employed algorithms of FFT?

example timing where array size is N:

N – DEVICE Time (ms.)
2,097,152 – 0.96
4,194,314 – 253.50
8,388,608 – 2.15

I am using a single GPU (Titan V) with compute architecture 10.1. A sample of the code I am using follows:

#include <cufft.h>
#include <stdio.h>

// I left off CUDA timing and error handling as I just want to
// know if there is something I am doing wrong with calling the
// cuFFT library
int main(){
  // length of array - when N is 2097157 the performance is 
  // significantly worse than either 1048576 or 4194304
  const unsigned int N = 2097157;  

  cuComplex *darray, *harray, *result;
  harray = (cuComplex*)malloc(2*N*sizeof(cuComplex);
  result = (cuComplex*)malloc(2*N*sizeof(cuComplex);
  cudaMalloc((void**)&darray, 2*N*sizeof(cuComplex));

  // initialize 
  for(unsigned int i = 0; i < 2*N; ++i){
    harray[i].x = (float)i;
    harray[i].y = 1.0f;

  // copy to DEVICE
  cudaMemcpy(darray, harray, 2*N*sizeof(cuComplex), cudaMemcpyHostToDevice);

  // Didn't wrap these calls in error macros
  cufftHandle plan;
  cufftPlan1d(&plan, 2*N, CUFFT_C2C, 1);
  cufftExecC2C(plan, darray, darray, CUFFT_FORWARD);

  // copy to HOST
  cudaMemcpy(result, darray, 2*N*sizeof(cuComplex), cudaMemcpyDeviceToHost);


  return 0;

Any ideas or hints as to why this behavior occurs would be great.

Thank you

THe CUFFT performance is sensitive to the prime factorization of the size. For example power-of-2 sizes generally perform fastest. sizes that contain large prime factors may perform much slower.

This is documented int the CUFFT docs.

considering this:

2,097,152 – power of 2
4,194,314 – not a power of 2
8,388,608 – power of 2

if you change 4194314 to 4194304, you will see the performance drop in line with expectations (slower than 2097152, faster than 8388608)

Thank you for information.