Cuda Profiler Performance boost?

Hi everybody,

posted this already here:

I have a very erratic behaviour together with one of my benchmark kernels. If I execute it within the CudaVisualProfiler it doubles its performance. The time measured by CudaVisualProfiler and my own timing routine are the same.



I am seeing this today for two of my kernels. They run twice as fast under the profiler.

Did you ever resolve this issue?

Unfortunately not.

However I had no further occurences with other codes.

Does your programs total run time change by an appreciable amount, too?

Enabling profiling causes an implicit cudaThreadSynchronize() after every kernel call. So, in the following situation:


mark time on wall clock

call kernel1

mark time on wall clock

call kernel 2


mark time on wall clock

The time spent in kernel 2 would appear to drastically decrease when enabling profiling because of the missing thread synchronize. You could see if you get the same behavior when you enable the “sync after every kernel call” environment variable, too (sorry, don’t recall the exact env var: check the release notes).

I opened up a different thread that may or may not be related to the original problem posted here:…=0&#entry452562

Anderson: Good suggestion. But as regards my problem (which may be different from original post): yes, I do synch and yes, the entire program speeds up.