CudaFFT decreasing performance

I’m doing some benchmarks and I’m noticing that executing 100 cufftExecC2C with a size of 2^20 I’m able to obtain
aroun 0.02 ms for each transformation, as soon I do 1000 cufftExecC2C then the time per tranformation rises to 3 ms,
anyone knows why ?

cufftExecC2C call is sync or async ? In the documentation I didn’t see any mention about it.

The control returns immediately to the CPU.
If you want to time them, insert a cudaThreadSynchronize:

time_start=wallclock();
cufftExecC2C(plan, c1_d,c1_d, CUFFT_FORWARD);
cudaThreadSynchronize();
time_end=wallclock();
printf(“Total Time : %f\n”,time_end-time_start);

That’s why then I’m obtaining bogos times. So given the fact is async with the host shall I make a

cudaThreadSynchronize() call before to memcpy the result ? or the memcpy does an intrinsic sync ?

The memcpy performs an implicit sync.