getElapsedTime vs Profiler

I used both getElapsedTime and profiler to find the elapsed time of my kernel.
But the profiler’s result is 10 times lower than elpasedTime’s result.
Which one do I have to believe?

And when I changed the kernel code and memcpy code, then profiler gives me that you called kernel 2 times, but in code 16 times.
For example

for i (0… 16){
MEMCPY()
KERNEL()
MEMCPY()
}

=> KERNEL 16 TIMES!

But
for i (0… 16){
MEMCPY()
KERNEL()
}
for i (0… 16){
MEMCPY()
}

=> KERNEL 2TIMES!

I thought it is caused by pipelining. Am I right? :(

The getElapsedTime() reported time will include host side overhead for the kernel launch. The profiler reported gputime does not include host side overhead - it is just the kernel execution time on the GPU. The profiler cputime should be closer to the time obtained using getElapsedTime().

The profiler output in the second case should also have the kernel launch 16 times. The profiler output could be incomplete due to some other reason. You can try adding a cudaDeviceSynchronize() call after the second loop.

Thanks, I don’t know why the CPU time has a lot of difference between the result of getElapsedTime() :(