Compute Visual Profiler # runs Compute Visual Profiler on CUDA

I hope this is the right forum for my question. Otherwise, please accept my apology and let me know the right forum for this to go to.

I am using Compute Visual Profiler to under my CUDA program’s performance. It seems the application launched within the profiler will always run 14 or 15 times before the profiler results are generated. Is there a way to control/customize how much times my application run? Is running an application 14/15 times designed for generating an averaged profile and avoiding noise?

Secondly, the application has been developed on Linux and has embedded debug information into the build. Can Visual Profiler trace where the CUDA kernel/methods are called? For instance, how can I find out which line in the source code is calling “memcopyDtoH” or “memcpyHtoD” which is actually a further abstraction of cudaMemcpy() depending on the source and the destination pointers? Some cuda API’s, e.g. “cudaThreadSynchronize()”, “cudaBindTexture()”, “cudaGetLastError()”, are not reported despite I have turned on “CUDA API trace” and all the possible “Profiler Counters”.

Thanks a lot.

To get only one run at a time, select the minimum number of Profile counters, usually the one with instruction and time.

Looking at the report generated under analyse, you will find a lot of information on the running of the program.

Thank you very much.

This is very much a trial-and-error approach. Are there documents showing which counters/events are conflict with each other?