nvprof with tensorflow is suspiciously slow

Running label_image example from tensorflow takes very little time to execute as you can see below

$ ./bazel-bin/tensorflow/examples/label_image/label_image --image=../spaceman.jpg
2019-01-19 16:49:17.894710: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-01-19 16:49:17.896497: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcuda.so.1 locally
2019-01-19 16:49:17.981149: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1003] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-01-19 16:49:17.981647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1434] Found device 0 with properties: 
name: Quadro M2000 major: 5 minor: 2 memoryClockRate(GHz): 1.1625
pciBusID: 0000:26:00.0
totalMemory: 3.95GiB freeMemory: 3.81GiB
2019-01-19 16:49:17.981667: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0
2019-01-19 16:49:17.982281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-19 16:49:17.982293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991]      0 
2019-01-19 16:49:17.982300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0:   N 
2019-01-19 16:49:17.982453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3601 MB memory) -> physical GPU (device: 0, name: Quadro M2000, pci bus id: 0000:26:00.0, compute capability: 5.2)
2019-01-19 16:49:18.102724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0
2019-01-19 16:49:18.102787: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-19 16:49:18.102796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991]      0 
2019-01-19 16:49:18.102804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0:   N 
2019-01-19 16:49:18.102947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3601 MB memory) -> physical GPU (device: 0, name: Quadro M2000, pci bus id: 0000:26:00.0, compute capability: 5.2)
2019-01-19 16:49:20.393744: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcudnn.so.7 locally
2019-01-19 16:49:22.336192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0
2019-01-19 16:49:22.336251: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-19 16:49:22.336268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991]      0 
2019-01-19 16:49:22.336284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0:   N 
2019-01-19 16:49:22.336439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3601 MB memory) -> physical GPU (device: 0, name: Quadro M2000, pci bus id: 0000:26:00.0, compute capability: 5.2)
2019-01-19 16:49:22.339939: I tensorflow/examples/label_image/main.cc:251] space shuttle (813): 0.27126
2019-01-19 16:49:22.339976: I tensorflow/examples/label_image/main.cc:251] ski (796): 0.172458
2019-01-19 16:49:22.339986: I tensorflow/examples/label_image/main.cc:251] jigsaw puzzle (612): 0.0209833
2019-01-19 16:49:22.339996: I tensorflow/examples/label_image/main.cc:251] ballplayer (982): 0.0197843
2019-01-19 16:49:22.340004: I tensorflow/examples/label_image/main.cc:251] parachute (702): 0.0134006

Please note that the time differences between steps are really small.

However, when I use nvprof with just one metric, the runtime gets extremely slow.

$ nvprof  --concurrent-kernels on --profile-api-trace all \
       --profile-from-start on --system-profiling off --unified-memory-profiling \
       per-process-device --track-memory-allocations off --device-buffer-size 8 \
       --device-cdp-buffer-size 8 --openacc-profiling on --devices 0 \
       --aggregate-mode on --metrics flop_sp_efficiency -f -o run1.nvvp \
       --log-file-verbose run1.nv.log \
       ./bazel-bin/tensorflow/examples/label_image/label_image --image=../spaceman.jpg
2019-01-19 17:48:03.081089: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-01-19 17:48:03.084178: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcuda.so.1 locally
nvprof log: /home/mahmood/tf/tensor/run1.nv.log
2019-01-19 17:48:03.388034: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1003] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-01-19 17:48:03.388553: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1434] Found device 0 with properties: 
name: Quadro M2000 major: 5 minor: 2 memoryClockRate(GHz): 1.1625
pciBusID: 0000:26:00.0
totalMemory: 3.95GiB freeMemory: 3.81GiB
2019-01-19 17:48:03.388575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0
2019-01-19 17:48:03.389218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-19 17:48:03.389230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991]      0 
2019-01-19 17:48:03.389241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0:   N 
2019-01-19 17:48:03.389417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3599 MB memory) -> physical GPU (device: 0, name: Quadro M2000, pci bus id: 0000:26:00.0, compute capability: 5.2)
2019-01-19 17:48:03.508531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0
2019-01-19 17:48:03.508597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-19 17:48:03.508603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991]      0 
2019-01-19 17:48:03.508610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0:   N 
2019-01-19 17:48:03.508760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3599 MB memory) -> physical GPU (device: 0, name: Quadro M2000, pci bus id: 0000:26:00.0, compute capability: 5.2)
2019-01-19 17:48:50.326734: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcudnn.so.7 locally
^C^C^C^Cnvprof log: /home/mahmood/tf/tensor/run1.nv.log

That is Realllly slow and after one hour, I pressed ^c to terminate that. I see that some processes are running on GPU and CPU. So I think the profiler has not been freezed.

I have to say that I have profiled the same metric for some other programs that run for 1-2 minutes and they were fine.

Any idea?

Sometimes when things that should be really quick take a long time its because I am running low on host memory.

Also make sure you are not triggering a JIT-compile of libraries.

Is that a feature from nvprof command or tensorflow configure&make procedure?
Is there any way to speedup that? I am talking about nvprof options. Since I have to check about 20-30 metrics, that will be a huge waiting time and I expected to take for months!

I don’t understand the questions. They don’t seem to refer to anything I said.

I have no nvprof options to suggest to speed it up.

In addition to the points raised by Robert Crovella in #2, make sure that there aren’t any issues with writing to the log file:

/home/mahmood/tf/tensor/run1.nv.log

Is this on a network-mounted volume by any chance?

I thought that disabling JIT is a nvprof related thing. It seems that I have to disable that at the configure step of TensorFLow. However, I didn’t find an option for that. If someone knows that, please let me know.

Regarding the log file, there is nothing important in it.

==15250== NVPROF is profiling process 15250, command: ./bazel-bin/tensorflow/examples/label_image/label_image --image=../spaceman.jpg
==15250== Generated result file: /home/mahmood/tf/tensor/run1.nvvp

When it is finished after a long time. Otherwise it just writes that program has been terminated by signal (^C).

Generally speaking, to avoid JIT at runtime, you have to compile your application for the specific GPU architecture you are running on.

https://devblogs.nvidia.com/cuda-pro-tip-understand-fat-binaries-jit-caching/

If you did a binary install of tensorflow e.g. via pip, then you don’t really have control over this, but I think most TF wheels from reputable sources should be compiled for both sm_50 and sm_52 which should cover your GPU. (But for completeness, this should probably be confirmed/verified).

If you built TF yourself from sources (maybe you did since you are in a bazel directory), then you would want to make sure the compile configuration included the specific switches to match your GPU architecture. You can discover your GPU architecture (compute major and minor versions) using the deviceQuery sample code.

(You would also want to make sure that you are using a new enough CUDA version so that the libraries like CUDNN/CUBLAS include the binaries for your GPU. This is almost certainly not the issue on a Maxwell device.)

I’m not at all convinced this is the source of the slowdown, however. I don’t really know what the issue is.

I would run top while all this is going on, and see what the host memory situation is.

I wasn’t talking about the content of the log file. File operations can become very slow in various circumstances, slowing an application that operates on those files.

I am just brainstorming here. No strong conclusions can be drawn from your description of the situation. You would want to eliminate possible sources of slow-down one by one. Robert Crovella pointed at the most obvious / likely ones.

Trying to debug such issues by conversing on the internet is about as efficient and as effective as trying to diagnose car trouble over the phone.