cuFFT DFT Performance question

Please help,

I have a performance question regarding cuFFT using Complex-to-Complex forward FFT on 1D array - no errors or unexpected data, just performance question.

The observed performance for cuFFT forward FFT drops significantly when the array length is 22,097,157 (4,194,314), while array sizes 21,048,576 (2,097,152) and 2*4,194,304 (8,388,608) perform as expected. Is this an issue with FFT in general, an artifact of the employed algorithms of FFT?

example timing where array size is N:

N – DEVICE Time (ms.)
2,097,152 – 0.96
4,194,314 – 253.50
8,388,608 – 2.15

I am using a single GPU (Titan V) with compute architecture 10.1. A sample of the code I am using follows:

#include <cufft.h>
#include <stdio.h>

// I left off CUDA timing and error handling as I just want to
// know if there is something I am doing wrong with calling the
// cuFFT library
int main(){
  // length of array - when N is 2097157 the performance is 
  // significantly worse than either 1048576 or 4194304
  const unsigned int N = 2097157;  

  cuComplex *darray, *harray, *result;
  harray = (cuComplex*)malloc(2*N*sizeof(cuComplex);
  result = (cuComplex*)malloc(2*N*sizeof(cuComplex);
  cudaMalloc((void**)&darray, 2*N*sizeof(cuComplex));

  // initialize 
  for(unsigned int i = 0; i < 2*N; ++i){
    harray[i].x = (float)i;
    harray[i].y = 1.0f;

  // copy to DEVICE
  cudaMemcpy(darray, harray, 2*N*sizeof(cuComplex), cudaMemcpyHostToDevice);

  // Didn't wrap these calls in error macros
  cufftHandle plan;
  cufftPlan1d(&plan, 2*N, CUFFT_C2C, 1);
  cufftExecC2C(plan, darray, darray, CUFFT_FORWARD);

  // copy to HOST
  cudaMemcpy(result, darray, 2*N*sizeof(cuComplex), cudaMemcpyDeviceToHost);


  return 0;

Any ideas or hints as to why this behavior occurs would be great.

Thank you