Now I installed on my 64-Bit machine the newest cuda + cudaprof, and now there the cpu time doesnt work either. The numbers are utterly nonsense and profiling of cpu time is not possible. Am I the only one with this problem?
Edit: Its not a problem of the visual profiler, the wrong cputime is also shown with cli profiling.
Yeah, this isn’t a bug. The CPU is treated as waiting for the kernel to finish as if it isn’t asynchronous. So basically what E.D. said is what is going on.
I know that cpu time = gpu time + approx. 20 us overhead
and that is exactly what I am expecting and what I get with old Cuda version.
My problem is that I get this result with the new version only when enabling additional signals.
If you look at the numbers of my first posting
i.e. e.g. gpu time 206162 and cpu time 4566 which cant be.
you still wrote that gpu time is smaller than cpu time, that is what got us on the wrong foot ;)
I have no idea though why this happens. Are you sure you don’t have the columns backwards? I always use the visual profiler. It would be an explanation:
when not too many signals your program gets run once → first kernel call overhead
when enough signals are selected the profiler needsto run it twice or three times and I guess it will take the CPU time of one of the last runs.
I analyzed it a bit more: Even if I enable just one additional signal in the visual profiler (like “gld uncoalesced”) it works - and this performs just one run. If I enable no additional signal (just use the default timestamps) the cpu time is bogus.
can someone explain what
“gld [un]coalesced”, “gst [un]coalesced”, “local load”, “local store”, “branch”, “divergent branch”, “warp serialize” and “cta launched” mean, and what values are good and which mean there should be optimization? Is there a manual for this somewhere? The memcpy time, etc. is useful, but I think I could have just as well measured it with a timer…