How to Understand output of nvprof?

I have run some benchmark to get the power, What I don’t understand is that is the result of BlackscholesGPU + CUDA memcpy D to H and H to D? I mean the power shown below or is it the sum of all the API calls made? In short, I want to know for what total time the program run and what was power?

==10883== Profiling result:
Time(%) Time Calls Avg Min Max Name
87.94% 336.21ms 512 656.66us 651.41us 663.28us BlackScholesGPU(float*, float*, float*, float*, float*, float, float, int)
7.02% 26.842ms 2 13.421ms 13.390ms 13.452ms [CUDA memcpy DtoH]
5.04% 19.282ms 3 6.4272ms 6.2572ms 6.7348ms [CUDA memcpy HtoD]

==10883== System profiling result:
Device “Tesla K40c (0)”
Count Avg Min Max
SM Clock (MHz) 71 651.55 324.00 666.00
Memory Clock (MHz) 71 2890.76 324.00 3004.00
Temperature (C) 141 38.79 37.00 40.00
Power (mW) 141 71300.77 20469.00 161870.00
Fan (%) 71 23.00 23.00 23.00
Device “GeForce 8800 GTS 512 (1)”
Count Avg Min Max
Temperature (C) 141 62.00 62.00 62.00
Fan (%) 71 37.00 37.00 37.00

==10883== API calls:
Time(%) Time Calls Avg Min Max Name
40.50% 328.12ms 2 164.06ms 321.18us 327.80ms cudaDeviceSynchronize
38.45% 311.48ms 5 62.297ms 211.16us 310.59ms cudaMalloc
13.01% 105.42ms 1 105.42ms 105.42ms 105.42ms cudaDeviceReset
6.11% 49.538ms 5 9.9077ms 6.4856ms 14.759ms cudaMemcpy
0.72% 5.8577ms 512 11.440us 10.967us 49.541us cudaLaunch
0.58% 4.7136ms 5 942.73us 905.14us 1.0302ms cudaGetDeviceProperties
0.25% 1.9876ms 168 11.831us 322ns 427.58us cuDeviceGetAttribute
0.17% 1.4023ms 4096 342ns 306ns 2.2600us cudaSetupArgument
0.09% 760.95us 5 152.19us 120.46us 269.24us cudaFree
0.03% 221.90us 2 110.95us 107.11us 114.79us cuDeviceTotalMem
0.03% 209.69us 512 409ns 381ns 2.8250us cudaConfigureCall
0.02% 200.25us 512 391ns 374ns 456ns cudaGetLastError
0.02% 191.88us 2 95.939us 78.852us 113.03us cuDeviceGetName
0.00% 8.6430us 1 8.6430us 8.6430us 8.6430us cudaSetDevice
0.00% 7.5760us 2 3.7880us 1.9640us 5.6120us cuDeviceGetPCIBusId
0.00% 6.6570us 11 605ns 314ns 2.0430us cuDeviceGet
0.00% 3.8210us 4 955ns 588ns 1.5870us cuDeviceGetCount
0.00% 3.3340us 2 1.6670us 363ns 2.9710us cudaGetDeviceCount

The GPU power consumption will usually vary during program execution. The power measurement is a sampled measurement:

[url]http://docs.nvidia.com/cuda/profiler-users-guide/index.html#system-profiling[/url]

If you add --print-gpu-trace to your profiler command (and perhaps drop the --print-api-trace command) you can get an idea of overall program execution flow and time.