computeprof function granularity


I have a large application being ported from existing code. There are a total of 32 device funtions. When I try to use computeprof, it only gives me information on my ‘kernel main’ function. Is there a way to get something like % execution time for each of these functions like most profilers do (actually, I’ve seen line by line information) so that I can see what is causing the terrible performance I’m seeing?
I’m using CUDA 4.0 on a card with compute capability 2.1. I know that the compiler is going to try and inline as much as it can, but I was hoping for a little more granularity on stats reporting.


You can turn off inlining with [font=“Courier New”]-Xopencc -noinline[/font] (unless you are using the CUDA 4.1 release candidate), although that is not going to change the per-kernel reporting of the profiler.

I’m afraid if you want more detailed performance info, you’ll probably have to instrument your code manually using [font=“Courier New”]clock()[/font] (or [font=“Courier New”]clock64()[/font] if times get that large) at the start and end of the sections of interest and atomic addition of the difference to global variables.


I see that in cudaProfiler.h, there is only cudaProfilerInit, cudaProfilerStart and cudaProfilerStop. Isn’t there anything to ask it to instrument the individual functions for execution time?