nvprof and difference in time reported

jmricher70 · September 16, 2017, 2:33am

Hi,
I am using CUDA 8.0 under Linux Ubuntu 64 bits and a GTX 1070.
I wrote a program that does a simple sum of vectors z[i] = x[i] + y[i]
and call the computation 10000 times.

I have used nvprof to check the time passed in each CUDA function but I don’t understand
why I get a difference in times. Here is the result of nvprof:

Time(%) Time Calls Avg Min Max Name
65.96% 839.07ms 20000 41.953us 41.024us 44.161us [CUDA memcpy HtoD]
31.47% 400.30ms 10000 40.029us 39.937us 40.512us [CUDA memcpy DtoH]
2.57% 32.649ms 10000 3.2640us 3.0080us 9.3440us kernel_sum(float*, float*, float*, int)

==28683== API calls:
Time(%) Time Calls Avg Min Max Name
92.09% 2.16663s 30000 72.220us 22.665us 867.18us cudaMemcpy
5.11% 120.27ms 3 40.089ms 2.5130us 120.26ms cudaMalloc
2.29% 53.918ms 10000 5.3910us 4.8550us 1.6708ms cudaLaunch
0.29% 6.8177ms 40000 170ns 117ns 244.09us cudaSetupArgument
0.10% 2.3756ms 10000 237ns 220ns 3.4030us cudaConfigureCall
0.08% 1.7916ms 10000 179ns 151ns 2.7150us cudaGetLastError
… the rest is in us (micro seconds)

From the first 4 lines, if I sum all reported times I get:
0.839+0.400+0.032 s = 1.271s

But from the API calls it tells me that I spend 2.166s so there is nearly 900ms
difference ! Does any one know why ?

Regards,
JM

Robert_Crovella · September 16, 2017, 4:13am

You’d probably have to look at the timeline for gaps. Is your 1070 running a display?

The first set of numbers are the measured times for the operations. The second set of numbers are the durations of the api activity.

For example, a kernel call is asynchronous. That means a cudaMemcpy after a kernel call will immediately “begin”, but since it is a synchronizing and blocking call, it will wait for previous activity to complete. Thus the api “duration” of the call will be longer than the actual transfer time since it is waiting for previous (kernel) activity to complete.

This doesn’t really explain completely your numbers, however the blocking behavior of a cudaMemcpy after a kernel call, along with other unspecified activity which creates gaps in the timeline, could be a factor. One such “outside” activity would be display updates. Other than that, you’d probably have to look at the timeline itself.

jmricher70 · September 16, 2017, 4:18am

Thank you for your explanation, I am indeed using the GTX as display too, so that could explain the difference.

njuffa · September 16, 2017, 5:31am

Maybe I am misunderstanding the question. I am not sure why one would expect the overall duration of these two event lists to match. The first list covers device-side activity, the second covers host-side activity. There can obviously be more host-side activity than device-side activity in any program. E.g. nothing happens on the device during a cudaMalloc() call, which manipulates host-side control structures that track GPU memory allocation.

jmricher70 · September 16, 2017, 7:49am

Maybe it is obvious for you, but it was not for me that the “API calls” was taking into account the host-side activity, or at least I was expecting to see this host-side activity to appear somewhere.

Topic		Replies	Views
Execution time is different in Profiller and Console. why? CUDA Programming and Performance	4	3792	August 3, 2009
How to explain the performance difference? CUDA Programming and Performance	7	3569	March 26, 2008
Difference between time measured and time reported by profiler CUDA Programming and Performance	0	970	January 19, 2009
Time of API calls in nvprof's output is consumed in GPU or CPU Jetson TX2	2	602	October 18, 2021
What do you understand by CPU time? CPU time, computational load, cuda prof CUDA Programming and Performance	8	2476	July 11, 2008
cudaMemcpy timing CUDA Programming and Performance	1	6815	December 8, 2010
help me understanding the report of Profiler about reading the Profiler report CUDA Programming and Performance	1	1082	December 23, 2008
How to Understand output of nvprof? CUDA Programming and Performance	1	2548	June 3, 2015
About CUDA CUDA Programming and Performance	2	4756	December 3, 2008
Profiler Times just need some info CUDA Programming and Performance	4	4579	June 16, 2010

nvprof and difference in time reported

Related topics