How to Understand output of nvprof?

I have run some benchmark to get the power, What I don’t understand is that is the result of BlackscholesGPU + CUDA memcpy D to H and H to D? I mean the power shown below or is it the sum of all the API calls made? In short, I want to know for what total time the program run and what was power?

==10883== Profiling result:
Time(%) Time Calls Avg Min Max Name
87.94% 336.21ms 512 656.66us 651.41us 663.28us BlackScholesGPU(float*, float*, float*, float*, float*, float, float, int)
7.02% 26.842ms 2 13.421ms 13.390ms 13.452ms [CUDA memcpy DtoH]
5.04% 19.282ms 3 6.4272ms 6.2572ms 6.7348ms [CUDA memcpy HtoD]

==10883== System profiling result:
Device “Tesla K40c (0)”
Count Avg Min Max
SM Clock (MHz) 71 651.55 324.00 666.00
Memory Clock (MHz) 71 2890.76 324.00 3004.00
Temperature © 141 38.79 37.00 40.00
Power (mW) 141 71300.77 20469.00 161870.00
Fan (%) 71 23.00 23.00 23.00
Device “GeForce 8800 GTS 512 (1)”
Count Avg Min Max
Temperature © 141 62.00 62.00 62.00
Fan (%) 71 37.00 37.00 37.00

==10883== API calls:
Time(%) Time Calls Avg Min Max Name
40.50% 328.12ms 2 164.06ms 321.18us 327.80ms cudaDeviceSynchronize
38.45% 311.48ms 5 62.297ms 211.16us 310.59ms cudaMalloc
13.01% 105.42ms 1 105.42ms 105.42ms 105.42ms cudaDeviceReset
6.11% 49.538ms 5 9.9077ms 6.4856ms 14.759ms cudaMemcpy
0.72% 5.8577ms 512 11.440us 10.967us 49.541us cudaLaunch
0.58% 4.7136ms 5 942.73us 905.14us 1.0302ms cudaGetDeviceProperties
0.25% 1.9876ms 168 11.831us 322ns 427.58us cuDeviceGetAttribute
0.17% 1.4023ms 4096 342ns 306ns 2.2600us cudaSetupArgument
0.09% 760.95us 5 152.19us 120.46us 269.24us cudaFree
0.03% 221.90us 2 110.95us 107.11us 114.79us cuDeviceTotalMem
0.03% 209.69us 512 409ns 381ns 2.8250us cudaConfigureCall
0.02% 200.25us 512 391ns 374ns 456ns cudaGetLastError
0.02% 191.88us 2 95.939us 78.852us 113.03us cuDeviceGetName
0.00% 8.6430us 1 8.6430us 8.6430us 8.6430us cudaSetDevice
0.00% 7.5760us 2 3.7880us 1.9640us 5.6120us cuDeviceGetPCIBusId
0.00% 6.6570us 11 605ns 314ns 2.0430us cuDeviceGet
0.00% 3.8210us 4 955ns 588ns 1.5870us cuDeviceGetCount
0.00% 3.3340us 2 1.6670us 363ns 2.9710us cudaGetDeviceCount

The GPU power consumption will usually vary during program execution. The power measurement is a sampled measurement:

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#system-profiling

If you add --print-gpu-trace to your profiler command (and perhaps drop the --print-api-trace command) you can get an idea of overall program execution flow and time.