I have a very erratic behaviour together with one of my benchmark kernels. If I execute it within the CudaVisualProfiler it doubles its performance. The time measured by CudaVisualProfiler and my own timing routine are the same.
Does your programs total run time change by an appreciable amount, too?
Enabling profiling causes an implicit cudaThreadSynchronize() after every kernel call. So, in the following situation:
cudaThreadSyncrhonize()
mark time on wall clock
call kernel1
mark time on wall clock
call kernel 2
cudaThreadSynchronize()
mark time on wall clock
The time spent in kernel 2 would appear to drastically decrease when enabling profiling because of the missing thread synchronize. You could see if you get the same behavior when you enable the “sync after every kernel call” environment variable, too (sorry, don’t recall the exact env var: check the release notes).
Anderson: Good suggestion. But as regards my problem (which may be different from original post): yes, I do synch and yes, the entire program speeds up.