I ran two examples of fft for my undergrad thesis :
one with zero copy and one with plain cudaMalloc.
The weird thing is the zero_copy version is faster even in the compute stage.(which logically should be impossible since global memory is faster than any interleaving of the PCI bus).
Any ideas why this is happening?
I’m attaching the two files.
Thank you for your time
simpleCUFFT_zero_copy.cu (3.26 KB)
simpleCUFFT.cu (3.72 KB)
I re-runned my experiments and the results are still the same.
I don’t get why this is happening/
In your zero copy code, i don’t see cudaHostGetDevicePointer function. isn’t that needed?
Your experiment has a couple of problems.
you are not catching any errors the CUDA runtime might throw at you, I suspect the FFT routine returns with an error and does no calculation at all
you are not validating the result; you should make sure what you calculate is correct using for example a proven host FFT function
when running benchmarks in a CUDA environment you should not measure the first function call as it might entail some overhead
Furthermore you should provide us with the timer.h file you use so we can run your code ourselves.
here is the timer.h
timer.h (1.25 KB)
sorry i didn’t post it earlier.
It seems like there is an error in the code i will try it again and tell you how it goes.
Edit:fixed the code,the timmings remain the same.(maybe cufft does something behind the scenes?or the fft calculation is that fast that it makes no difference in time,error or not)
Here is the new file for the zero copy :
simpleCUFFT.cu (3.31 KB)
I timed the whole thing to be as fair as i could be to the both of the implementations
As I said before, you’re not catching any CUFFT or CUDA errors. A quick test of your code showed that your cufftExecC2C does not return CUFFT_SUCCESS but CUFFT_INVALID_PLAN on my system.
I think i copied and pasted the code directly from the SDK soure so it seems weird to me.
I will check it again.
Thank you for your effort.