Time loss...


My boss wanted me to benchmark the FFT running on a CUDA GPU, to know if it can be used for a real-time application (latency < 1ms).

I did it, and found out, using the NVIDIA Visual Profiler, that there is a huge blank in the timeline I can’t identify.
This blank lasts 100us and the FFT process (with the memory copy) lasts only 50us.

This blank occurs only before the call of the FFT function (cufftExecC2C), and after the Host to Device Memory copy.

Here’s a screenshot of the timeline :

Uploaded with ImageShack.us

PS : What I think is that this is due to the call of an “external” library function, but can’t be sure.

Thank you

Maybe I wasn’t accurate enough.

I’m currently looking for a way to reduce this blank (if such a way exists).

Here’s what’s happening during this time :




The longest is the last one.

Is there any possibility to decrease the process time of this FFT process ?

Before you invest too much time tackling this problem, make sure the gap exists if your program is run without the profiler…

I’ve tested it without the profiler (with cudaEvent and others process time display functions), same “issue” (maybe it’s normal, but I’d like to be sure)…

In fact, I used the profiler when I found out the process time was higher than expected (I expected maybe too much ;) ).

To be clearer, here’s the simple process I measure :


Thank you for the answer.