CUFFT Question? Confusing CUFFT times

I have a question about the CUFFT times that I have got for different 1D arrays. I have put down the times for

the memory being transfered in, creating the plan, executing the plan, memory being transfered out, the sum of

the pervious times and the time from a timer that times all the steps. Here is the code that I programmed to get

the times:

[codebox]unsigned int main_timer = 63;

CUT_SAFE_CALL(cutCreateTimer(&main_timer));

unsigned int sec_timer = 64;

CUT_SAFE_CALL(cutCreateTimer(&sec_timer));

//Create complex number array

int mem_size = sizeof(Complex) * size;

Complex* h_signal = (Complex*)malloc(mem_size);

for (int i = 0; i < size; i++)

{

h_signal[i].x = in[i];

h_signal[i].y = 0;

}

CUT_SAFE_CALL(cutStartTimer(main_timer));

CUT_SAFE_CALL(cutStartTimer(sec_timer));

//Transfer memory in

Complex* d_signal;

CUDA_SAFE_CALL(cudaMalloc((void**)&d_signal, mem_size));

cudaMemcpy(d_signal, h_signal, mem_size, cudaMemcpyHostToDevice);

CUT_SAFE_CALL( cutStopTimer(sec_timer) );

time1 = cutGetTimerValue(sec_timer);

CUT_SAFE_CALL(cutResetTimer(sec_timer));

CUT_SAFE_CALL(cutStartTimer(sec_timer));

//Create plan

cufftHandle plan;

CUFFT_SAFE_CALL(cufftPlan1d(&plan, size, CUFFT_C2C, 1));

CUT_SAFE_CALL( cutStopTimer(sec_timer) );

time2 = cutGetTimerValue(sec_timer);

CUT_SAFE_CALL(cutResetTimer(sec_timer));

CUT_SAFE_CALL(cutStartTimer(sec_timer));

//Execute

if(CUFFT_SAFE_CALL(cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD)) == CUFFT_SUCCESS);

CUT_SAFE_CALL( cutStopTimer(sec_timer) );

time3 = cutGetTimerValue(sec_timer);

CUT_SAFE_CALL(cutResetTimer(sec_timer));

CUT_SAFE_CALL(cutStartTimer(sec_timer));

//Transfer memory out

CUDA_SAFE_CALL(cudaMemcpy(h_signal, d_signal, mem_size, cudaMemcpyDeviceToHost));

CUFFT_SAFE_CALL(cufftDestroy(plan));

CUT_SAFE_CALL( cutStopTimer(main_timer) );

CUT_SAFE_CALL( cutStopTimer(sec_timer) );

time4 = cutGetTimerValue(sec_timer);

time = cutGetTimerValue(main_timer);

CUT_SAFE_CALL( cutDeleteTimer(main_timer) );

CUT_SAFE_CALL( cutDeleteTimer(sec_timer) );[/codebox]

TIMES (ms):

FFT size Mem In Create Plan EXEC Mem out ACCUM Whole time

100000 58.644646 0.242568 2.588642 2.639214 64.11507 64.116829

200000 1.120684 0.139344 3.631237 4.711313 9.602578 9.603992

300000 1.665527 0.164472 7.664171 6.96609 16.460261 16.460789

400000 2.337419 0.194217 8.226549 17.603008 28.361193 28.362558

500000 3.078444 0.232129 6.629548 19.2332 29.173321 29.17485

600000 3.498785 0.238939 19.037582 14.595923 37.37123 37.372135

700000 4.245278 0.489798 30.174217 12.14481 47.054103 47.055618

800000 4.982587 0.277627 32.83313 13.027261 51.120604 51.122158

900000 5.217419 0.275692 33.990204 21.571005 61.05432 61.055973

1000000 7.640786 0.343416 44.607174 20.091688 72.683064 72.684746

1100000 6.989479 0.611454 65.546646 23.487913 96.635491 96.63726

1200000 7.308928 0.66276 54.078533 23.02948 85.079701 85.08136

1300000 7.480165 1.073727 0.351894 112.932503 121.838289 121.838943

1400000 9.015682 1.076787 0.297168 99.88903 110.278667 110.28006

1500000 8.371383 1.113619 0.310337 109.085182 118.880521 118.882622

1600000 10.088757 1.189116 0.283315 111.248756 122.809945 122.811195

1700000 9.107031 1.332154 0.265627 153.410782 164.115594 164.117477

1800000 11.009363 1.262533 0.338417 133.408844 146.019157 146.020569

1900000 10.056148 1.40488 0.336107 172.097229 183.894364 183.895035

2000000 11.442863 1.448575 0.295053 144.716827 157.903318 157.906052

2100000 12.568538 1.405342 0.28909 157.193024 171.455993 171.457336

2200000 13.987428 1.589422 0.268263 185.299759 201.144872 201.147293

2300000 12.249351 1.519636 0.264817 252.579285 266.613088 266.613708

2400000 14.57393 1.569075 0.286323 167.068329 183.497656 183.49826

2500000 14.369367 1.59222 0.289701 174.133591 190.384878 190.386261

2600000 14.170654 1.673252 0.270749 210.937668 227.052323 227.054276

2700000 16.279093 1.702637 0.29515 207.371368 225.648248 225.649582

2800000 16.064913 1.746207 0.283818 203.303268 221.398206 221.398911

2900000 17.074467 1.804372 0.411368 303.231903 322.52211 322.523651

3000000 18.088728 1.851202 0.326927 212.943298 233.210155 233.211624

3100000 17.000158 2.238409 0.266902 330.376923 349.882392 349.88382

3200000 18.424456 2.064671 0.317867 187.212082 208.019075 208.020752

3300000 17.69063 1.996305 0.284669 284.764893 304.736497 304.737915

3400000 18.342726 2.623175 0.372358 263.546295 284.884554 284.887329

3500000 19.659174 2.208664 0.285048 252.379379 274.532265 274.534088

3600000 20.125032 2.144327 0.296901 276.384521 298.950782 298.952087

3700000 19.799307 2.285204 0.276933 483.374664 505.736108 505.738647

3800000 22.398003 2.341562 0.267817 367.698395 392.705776 392.707275

From what I expect the columns 6 and 7 are almost equal. Also 2, 3, 6, and 7 increase “normally” in time as the

fft size increases. The question I have is for the execution stage. The time increases in time “normally” until the

fft size is 1300000. Then it goes near zero. It seems that the memory transfer out is now the bottle neck. I can

not seem to understand what that means. I even put an if statement around the execute call thinking that it was

tring to transfering the memory out before the GPU was done. I know that should not happen, but I tried it

anyways.

Can anyone help to explain this behavior?

Prelution

Let me add the excel file with the numbers and a graph too.
post.xls (50 KB)

Does no one have a comment on this?