cudaMemcpy(Async) and cufftExec Overlapping Transfert and Computation

Hello,
I have made exercices with a GTX 460: 1D-FFT of NX=16384 samples with Nbatch=50 (this is the maximum my board is supporting)
So in brief I am doing the following

  1. generate NX*Nbatch HOST array of real data
  2. cudaMemcpy from Host to Device
  3. compute the R2C FFT
  4. cudaMemcpy from Device to Host array of complex numbers.

As I am sure many of you have experienced, most of the time is spent in the cudaMemcpy processes. To be concret I have done test with 1000x50 FFT
and the time per FFT is

20 usec to upload the data Host->Device
< 1usec to perform the FFT
60 usec to download the data Device -> Host.

Now, does anybody has used cudaMemcpyAsync (see. section 3.1.2 of CUDA C Best Practices Guide) and does it may boost a little bit my program ???

Thanks
JE

(the CPU is a i5 and I am running Ubuntu 10.04 with Cuda Toolkit 3.2)

Hello,
I have made exercices with a GTX 460: 1D-FFT of NX=16384 samples with Nbatch=50 (this is the maximum my board is supporting)
So in brief I am doing the following

  1. generate NX*Nbatch HOST array of real data
  2. cudaMemcpy from Host to Device
  3. compute the R2C FFT
  4. cudaMemcpy from Device to Host array of complex numbers.

As I am sure many of you have experienced, most of the time is spent in the cudaMemcpy processes. To be concret I have done test with 1000x50 FFT
and the time per FFT is

20 usec to upload the data Host->Device
< 1usec to perform the FFT
60 usec to download the data Device -> Host.

Now, does anybody has used cudaMemcpyAsync (see. section 3.1.2 of CUDA C Best Practices Guide) and does it may boost a little bit my program ???

Thanks
JE

(the CPU is a i5 and I am running Ubuntu 10.04 with Cuda Toolkit 3.2)