I have made exercices with a GTX 460: 1D-FFT of NX=16384 samples with Nbatch=50 (this is the maximum my board is supporting)
So in brief I am doing the following
- generate NX*Nbatch HOST array of real data
- cudaMemcpy from Host to Device
- compute the R2C FFT
- cudaMemcpy from Device to Host array of complex numbers.
As I am sure many of you have experienced, most of the time is spent in the cudaMemcpy processes. To be concret I have done test with 1000x50 FFT
and the time per FFT is
20 usec to upload the data Host->Device
< 1usec to perform the FFT
60 usec to download the data Device -> Host.
Now, does anybody has used cudaMemcpyAsync (see. section 3.1.2 of CUDA C Best Practices Guide) and does it may boost a little bit my program ???
(the CPU is a i5 and I am running Ubuntu 10.04 with Cuda Toolkit 3.2)