Code instrumentation overhead


I know that both of the CUDA Visual Profiler and the OpenCL Profiler both leverage the extensive performance instrumentation in the code and hardware performance signals designed into NVIDIA GPUs to provide developers with insight into performance bottlenecks and opportunities for optimization.

My question is

How much is the performance overhead of the instrumentation? In another word, how much does the instrumentation slow down CUDA/OpenCL application executions?

Thank you!