I ran two examples of fft for my undergrad thesis :
one with zero copy and one with plain cudaMalloc.
The weird thing is the zero_copy version is faster even in the compute stage.(which logically should be impossible since global memory is faster than any interleaving of the PCI bus).
Any ideas why this is happening?
I’m attaching the two files.
Thank you for your time
sorry i didn’t post it earlier.
It seems like there is an error in the code i will try it again and tell you how it goes.
Edit:fixed the code,the timmings remain the same.(maybe cufft does something behind the scenes?or the fft calculation is that fast that it makes no difference in time,error or not)
Here is the new file for the zero copy :
As I said before, you’re not catching any CUFFT or CUDA errors. A quick test of your code showed that your cufftExecC2C does not return CUFFT_SUCCESS but CUFFT_INVALID_PLAN on my system.