I have a question about the CUFFT times that I have got for different 1D arrays. I have put down the times for
the memory being transfered in, creating the plan, executing the plan, memory being transfered out, the sum of
the pervious times and the time from a timer that times all the steps. Here is the code that I programmed to get
the times:
[codebox]unsigned int main_timer = 63;
CUT_SAFE_CALL(cutCreateTimer(&main_timer));
unsigned int sec_timer = 64;
CUT_SAFE_CALL(cutCreateTimer(&sec_timer));
//Create complex number array
int mem_size = sizeof(Complex) * size;
Complex* h_signal = (Complex*)malloc(mem_size);
for (int i = 0; i < size; i++)
{
h_signal[i].x = in[i];
h_signal[i].y = 0;
}
CUT_SAFE_CALL(cutStartTimer(main_timer));
CUT_SAFE_CALL(cutStartTimer(sec_timer));
//Transfer memory in
Complex* d_signal;
CUDA_SAFE_CALL(cudaMalloc((void**)&d_signal, mem_size));
cudaMemcpy(d_signal, h_signal, mem_size, cudaMemcpyHostToDevice);
CUT_SAFE_CALL( cutStopTimer(sec_timer) );
time1 = cutGetTimerValue(sec_timer);
CUT_SAFE_CALL(cutResetTimer(sec_timer));
CUT_SAFE_CALL(cutStartTimer(sec_timer));
//Create plan
cufftHandle plan;
CUFFT_SAFE_CALL(cufftPlan1d(&plan, size, CUFFT_C2C, 1));
CUT_SAFE_CALL( cutStopTimer(sec_timer) );
time2 = cutGetTimerValue(sec_timer);
CUT_SAFE_CALL(cutResetTimer(sec_timer));
CUT_SAFE_CALL(cutStartTimer(sec_timer));
//Execute
if(CUFFT_SAFE_CALL(cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD)) == CUFFT_SUCCESS);
CUT_SAFE_CALL( cutStopTimer(sec_timer) );
time3 = cutGetTimerValue(sec_timer);
CUT_SAFE_CALL(cutResetTimer(sec_timer));
CUT_SAFE_CALL(cutStartTimer(sec_timer));
//Transfer memory out
CUDA_SAFE_CALL(cudaMemcpy(h_signal, d_signal, mem_size, cudaMemcpyDeviceToHost));
CUFFT_SAFE_CALL(cufftDestroy(plan));
CUT_SAFE_CALL( cutStopTimer(main_timer) );
CUT_SAFE_CALL( cutStopTimer(sec_timer) );
time4 = cutGetTimerValue(sec_timer);
time = cutGetTimerValue(main_timer);
CUT_SAFE_CALL( cutDeleteTimer(main_timer) );
CUT_SAFE_CALL( cutDeleteTimer(sec_timer) );[/codebox]
TIMES (ms):
FFT size Mem In Create Plan EXEC Mem out ACCUM Whole time
100000 58.644646 0.242568 2.588642 2.639214 64.11507 64.116829
200000 1.120684 0.139344 3.631237 4.711313 9.602578 9.603992
300000 1.665527 0.164472 7.664171 6.96609 16.460261 16.460789
400000 2.337419 0.194217 8.226549 17.603008 28.361193 28.362558
500000 3.078444 0.232129 6.629548 19.2332 29.173321 29.17485
600000 3.498785 0.238939 19.037582 14.595923 37.37123 37.372135
700000 4.245278 0.489798 30.174217 12.14481 47.054103 47.055618
800000 4.982587 0.277627 32.83313 13.027261 51.120604 51.122158
900000 5.217419 0.275692 33.990204 21.571005 61.05432 61.055973
1000000 7.640786 0.343416 44.607174 20.091688 72.683064 72.684746
1100000 6.989479 0.611454 65.546646 23.487913 96.635491 96.63726
1200000 7.308928 0.66276 54.078533 23.02948 85.079701 85.08136
1300000 7.480165 1.073727 0.351894 112.932503 121.838289 121.838943
1400000 9.015682 1.076787 0.297168 99.88903 110.278667 110.28006
1500000 8.371383 1.113619 0.310337 109.085182 118.880521 118.882622
1600000 10.088757 1.189116 0.283315 111.248756 122.809945 122.811195
1700000 9.107031 1.332154 0.265627 153.410782 164.115594 164.117477
1800000 11.009363 1.262533 0.338417 133.408844 146.019157 146.020569
1900000 10.056148 1.40488 0.336107 172.097229 183.894364 183.895035
2000000 11.442863 1.448575 0.295053 144.716827 157.903318 157.906052
2100000 12.568538 1.405342 0.28909 157.193024 171.455993 171.457336
2200000 13.987428 1.589422 0.268263 185.299759 201.144872 201.147293
2300000 12.249351 1.519636 0.264817 252.579285 266.613088 266.613708
2400000 14.57393 1.569075 0.286323 167.068329 183.497656 183.49826
2500000 14.369367 1.59222 0.289701 174.133591 190.384878 190.386261
2600000 14.170654 1.673252 0.270749 210.937668 227.052323 227.054276
2700000 16.279093 1.702637 0.29515 207.371368 225.648248 225.649582
2800000 16.064913 1.746207 0.283818 203.303268 221.398206 221.398911
2900000 17.074467 1.804372 0.411368 303.231903 322.52211 322.523651
3000000 18.088728 1.851202 0.326927 212.943298 233.210155 233.211624
3100000 17.000158 2.238409 0.266902 330.376923 349.882392 349.88382
3200000 18.424456 2.064671 0.317867 187.212082 208.019075 208.020752
3300000 17.69063 1.996305 0.284669 284.764893 304.736497 304.737915
3400000 18.342726 2.623175 0.372358 263.546295 284.884554 284.887329
3500000 19.659174 2.208664 0.285048 252.379379 274.532265 274.534088
3600000 20.125032 2.144327 0.296901 276.384521 298.950782 298.952087
3700000 19.799307 2.285204 0.276933 483.374664 505.736108 505.738647
3800000 22.398003 2.341562 0.267817 367.698395 392.705776 392.707275
From what I expect the columns 6 and 7 are almost equal. Also 2, 3, 6, and 7 increase “normally” in time as the
fft size increases. The question I have is for the execution stage. The time increases in time “normally” until the
fft size is 1300000. Then it goes near zero. It seems that the memory transfer out is now the bottle neck. I can
not seem to understand what that means. I even put an if statement around the execute call thinking that it was
tring to transfering the memory out before the GPU was done. I know that should not happen, but I tried it
anyways.
Can anyone help to explain this behavior?
Prelution