cuda visual profiler

What do the terms of “memcpy” and “cpu time” in th profiler output represent?
Is the “memcpy” memcpy() executed by CPU?
The execution time of a kernel method equals to the sum of the “GPU time” + “CPU time”

No, CPU time is the time it takes a kernel call as seen from CPU. GPU time is the time the GPU is actually busy (kernel call overhead not included, that is included in CPU time)
memcopy is the time the memcopy call takes (CPU time also)

So GPU time is the time between kernel start and return on GPU and CPU time is the time between the start and end of the kernel call on CPU(kernel execution time not included). Do I understand you correctly?

CPU time is the time between the start and end of the kernel call on CPU(kernel execution time included)

But isn’t kernel call asynchoronous? The programing guide says the host won’t wait for a global kernel to complete.

And I got one profiling result showing a kernel with GPU time less than CPU time.

External Image My mistake, I was too careless reading the output…

In normal code yes, but in the profiler no.

CPU time = GPU time + overhead, so GPU time should always be less than CPU time. If you have a profile where CPU time < GPU time I would be very surprised and you should file a bug report ;)

I know the CUDA Visual profiler is a method to time your programs different parts, but is this an accurate way to time your program, or is it better to time it using the timers inside the program itself?

well you can accurately time your program between the first and last CUDA call I guess. You can ask the profiler to save CPU time. Than you can see how many time has passed between your cuda-related calls.

Enabling Profiler reduces the GPU clock and your app would actually run slower!!!

Dont use profiler to time your code!!!

Read “Release Notes” , search for “profiler” – If an application crashes while profiling is enabled, the GPU clocks remain reduced even for other GPU applications which dont need profiling. You have to reboot it to get it fixed…

– I experienced it just now!!! I was getting 57x and I rebooted and I got 65x – with no change in inputs or anything… Jusss the profiler clocks down the GPU…

Beware…

Yeah, definitely don’t time your code with profiler. It is VERY useful for determining what parts of your program are worth optimizing though :)

Hmm, that is good news indeed. My real performance will be even better than what I thought :)

CudaEvents it shall be I guess

I am facing the following problems with the CUDA profiler:

(1)
I have an application where after each kernel call, I use memcpy to copy some data from the GPU to the CPU. The memcpy does a very simple 4byte (integer) copy from the GPU to the CPU. This memcpy is called as many times as the kernel is called.

When I used the cuda visual profiler I found that it correctly reports the number of times the memcpy is called, but some how does not report the amount of time spent in the memcpy correctly. You can also see from the cuda profiler that memcpy time is reported in the GPU usec and not in CPU usec for any program.

So my question is does the size of data transfer (and hence the time spent) influence the result displayed by the visual profiler for memcpy operation time ?

yes, it differs. The reason you only see GPU time is as far as I know because the GPU is doing the memcopy.