Profiler - CPU Time

Here I got a little excerpt from the cudaprof

(functionname, calls, GPU time, CPU time, %)

matrix_vector_multiply_generated    400  206162  4566 54.44

matrix_vector_multiply              400  54548.8 4495 14.4

and what I find strange, the GPU Time is smaller than cpu time. How can this happen?

(This is machine at work GTX280, 32Bit 2.0b CUDA environment ,

at home with 64bit 1.1 CUDA environment on GTX8800 512 the output works as expected, with CPU time is approx. GPU time + 20 us overhead per call).

Anybody knows what happens?

Now I installed on my 64-Bit machine the newest cuda + cudaprof, and now there the cpu time doesnt work either. The numbers are utterly nonsense and profiling of cpu time is not possible. Am I the only one with this problem?

Edit: Its not a problem of the visual profiler, the wrong cputime is also shown with cli profiling.

I found a solution, how to make the buggy behaviour disappear:
enable additional signals

If I enable e.g. the gld signals, the reportet cpu time seems to be correct.

cpu time = gpu time + overhead.

Yeah, this isn’t a bug. The CPU is treated as waiting for the kernel to finish as if it isn’t asynchronous. So basically what E.D. said is what is going on.

I know that cpu time = gpu time + approx. 20 us overhead
and that is exactly what I am expecting and what I get with old Cuda version.
My problem is that I get this result with the new version only when enabling additional signals.
If you look at the numbers of my first posting
i.e. e.g. gpu time 206162 and cpu time 4566 which cant be.

you still wrote that gpu time is smaller than cpu time, that is what got us on the wrong foot ;)

I have no idea though why this happens. Are you sure you don’t have the columns backwards? I always use the visual profiler. It would be an explanation:

when not too many signals your program gets run once -> first kernel call overhead
when enough signals are selected the profiler needsto run it twice or three times and I guess it will take the CPU time of one of the last runs.

My fault - I really meant cpu < gpu

I analyzed it a bit more: Even if I enable just one additional signal in the visual profiler (like “gld uncoalesced”) it works - and this performs just one run. If I enable no additional signal (just use the default timestamps) the cpu time is bogus.

can someone explain what
“gld [un]coalesced”, “gst [un]coalesced”, “local load”, “local store”, “branch”, “divergent branch”, “warp serialize” and “cta launched” mean, and what values are good and which mean there should be optimization? Is there a manual for this somewhere? The memcpy time, etc. is useful, but I think I could have just as well measured it with a timer…