nvprof and difference in time reported

I am using CUDA 8.0 under Linux Ubuntu 64 bits and a GTX 1070.
I wrote a program that does a simple sum of vectors z[i] = x[i] + y[i]
and call the computation 10000 times.

I have used nvprof to check the time passed in each CUDA function but I don’t understand
why I get a difference in times. Here is the result of nvprof:

Time(%) Time Calls Avg Min Max Name
65.96% 839.07ms 20000 41.953us 41.024us 44.161us [CUDA memcpy HtoD]
31.47% 400.30ms 10000 40.029us 39.937us 40.512us [CUDA memcpy DtoH]
2.57% 32.649ms 10000 3.2640us 3.0080us 9.3440us kernel_sum(float*, float*, float*, int)

==28683== API calls:
Time(%) Time Calls Avg Min Max Name
92.09% 2.16663s 30000 72.220us 22.665us 867.18us cudaMemcpy
5.11% 120.27ms 3 40.089ms 2.5130us 120.26ms cudaMalloc
2.29% 53.918ms 10000 5.3910us 4.8550us 1.6708ms cudaLaunch
0.29% 6.8177ms 40000 170ns 117ns 244.09us cudaSetupArgument
0.10% 2.3756ms 10000 237ns 220ns 3.4030us cudaConfigureCall
0.08% 1.7916ms 10000 179ns 151ns 2.7150us cudaGetLastError
… the rest is in us (micro seconds)

From the first 4 lines, if I sum all reported times I get:
0.839+0.400+0.032 s = 1.271s

But from the API calls it tells me that I spend 2.166s so there is nearly 900ms
difference ! Does any one know why ?


You’d probably have to look at the timeline for gaps. Is your 1070 running a display?

The first set of numbers are the measured times for the operations. The second set of numbers are the durations of the api activity.

For example, a kernel call is asynchronous. That means a cudaMemcpy after a kernel call will immediately “begin”, but since it is a synchronizing and blocking call, it will wait for previous activity to complete. Thus the api “duration” of the call will be longer than the actual transfer time since it is waiting for previous (kernel) activity to complete.

This doesn’t really explain completely your numbers, however the blocking behavior of a cudaMemcpy after a kernel call, along with other unspecified activity which creates gaps in the timeline, could be a factor. One such “outside” activity would be display updates. Other than that, you’d probably have to look at the timeline itself.

Thank you for your explanation, I am indeed using the GTX as display too, so that could explain the difference.

Maybe I am misunderstanding the question. I am not sure why one would expect the overall duration of these two event lists to match. The first list covers device-side activity, the second covers host-side activity. There can obviously be more host-side activity than device-side activity in any program. E.g. nothing happens on the device during a cudaMalloc() call, which manipulates host-side control structures that track GPU memory allocation.

Maybe it is obvious for you, but it was not for me that the “API calls” was taking into account the host-side activity, or at least I was expecting to see this host-side activity to appear somewhere.