My boss wanted me to benchmark the FFT running on a CUDA GPU, to know if it can be used for a real-time application (latency < 1ms).
I did it, and found out, using the NVIDIA Visual Profiler, that there is a huge blank in the timeline I can’t identify.
This blank lasts 100us and the FFT process (with the memory copy) lasts only 50us.
This blank occurs only before the call of the FFT function (cufftExecC2C), and after the Host to Device Memory copy.
I’ve tested it without the profiler (with cudaEvent and others process time display functions), same “issue” (maybe it’s normal, but I’d like to be sure)…
In fact, I used the profiler when I found out the process time was higher than expected (I expected maybe too much External Image ).
To be clearer, here’s the simple process I measure :