What is GPU&CPU time in profiler? instrumentation overhead included?

What I’d like to know is, is the CPU time given by visual profiler close enough to the actual time comsumption without profiling? If not, how much is the overhead?

Another curious thing is that while my app runs slower in profiler, the timing result of certain kernels outside profiler is considerably longer than profiler result, and yes, I added cudaThreadSynchronize() after the kernels when I did the timing.

Can anyone explain to me how profiler works?