Running label_image example from tensorflow takes very little time to execute as you can see below
$ ./bazel-bin/tensorflow/examples/label_image/label_image --image=../spaceman.jpg
2019-01-19 16:49:17.894710: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-01-19 16:49:17.896497: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcuda.so.1 locally
2019-01-19 16:49:17.981149: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1003] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-01-19 16:49:17.981647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1434] Found device 0 with properties:
name: Quadro M2000 major: 5 minor: 2 memoryClockRate(GHz): 1.1625
pciBusID: 0000:26:00.0
totalMemory: 3.95GiB freeMemory: 3.81GiB
2019-01-19 16:49:17.981667: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0
2019-01-19 16:49:17.982281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-19 16:49:17.982293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0
2019-01-19 16:49:17.982300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N
2019-01-19 16:49:17.982453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3601 MB memory) -> physical GPU (device: 0, name: Quadro M2000, pci bus id: 0000:26:00.0, compute capability: 5.2)
2019-01-19 16:49:18.102724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0
2019-01-19 16:49:18.102787: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-19 16:49:18.102796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0
2019-01-19 16:49:18.102804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N
2019-01-19 16:49:18.102947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3601 MB memory) -> physical GPU (device: 0, name: Quadro M2000, pci bus id: 0000:26:00.0, compute capability: 5.2)
2019-01-19 16:49:20.393744: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcudnn.so.7 locally
2019-01-19 16:49:22.336192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0
2019-01-19 16:49:22.336251: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-19 16:49:22.336268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0
2019-01-19 16:49:22.336284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N
2019-01-19 16:49:22.336439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3601 MB memory) -> physical GPU (device: 0, name: Quadro M2000, pci bus id: 0000:26:00.0, compute capability: 5.2)
2019-01-19 16:49:22.339939: I tensorflow/examples/label_image/main.cc:251] space shuttle (813): 0.27126
2019-01-19 16:49:22.339976: I tensorflow/examples/label_image/main.cc:251] ski (796): 0.172458
2019-01-19 16:49:22.339986: I tensorflow/examples/label_image/main.cc:251] jigsaw puzzle (612): 0.0209833
2019-01-19 16:49:22.339996: I tensorflow/examples/label_image/main.cc:251] ballplayer (982): 0.0197843
2019-01-19 16:49:22.340004: I tensorflow/examples/label_image/main.cc:251] parachute (702): 0.0134006
Please note that the time differences between steps are really small.
However, when I use nvprof with just one metric, the runtime gets extremely slow.
$ nvprof --concurrent-kernels on --profile-api-trace all \
--profile-from-start on --system-profiling off --unified-memory-profiling \
per-process-device --track-memory-allocations off --device-buffer-size 8 \
--device-cdp-buffer-size 8 --openacc-profiling on --devices 0 \
--aggregate-mode on --metrics flop_sp_efficiency -f -o run1.nvvp \
--log-file-verbose run1.nv.log \
./bazel-bin/tensorflow/examples/label_image/label_image --image=../spaceman.jpg
2019-01-19 17:48:03.081089: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-01-19 17:48:03.084178: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcuda.so.1 locally
nvprof log: /home/mahmood/tf/tensor/run1.nv.log
2019-01-19 17:48:03.388034: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1003] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-01-19 17:48:03.388553: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1434] Found device 0 with properties:
name: Quadro M2000 major: 5 minor: 2 memoryClockRate(GHz): 1.1625
pciBusID: 0000:26:00.0
totalMemory: 3.95GiB freeMemory: 3.81GiB
2019-01-19 17:48:03.388575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0
2019-01-19 17:48:03.389218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-19 17:48:03.389230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0
2019-01-19 17:48:03.389241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N
2019-01-19 17:48:03.389417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3599 MB memory) -> physical GPU (device: 0, name: Quadro M2000, pci bus id: 0000:26:00.0, compute capability: 5.2)
2019-01-19 17:48:03.508531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0
2019-01-19 17:48:03.508597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-19 17:48:03.508603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0
2019-01-19 17:48:03.508610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N
2019-01-19 17:48:03.508760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3599 MB memory) -> physical GPU (device: 0, name: Quadro M2000, pci bus id: 0000:26:00.0, compute capability: 5.2)
2019-01-19 17:48:50.326734: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcudnn.so.7 locally
^C^C^C^Cnvprof log: /home/mahmood/tf/tensor/run1.nv.log
That is Realllly slow and after one hour, I pressed ^c to terminate that. I see that some processes are running on GPU and CPU. So I think the profiler has not been freezed.
I have to say that I have profiled the same metric for some other programs that run for 1-2 minutes and they were fine.
Any idea?