nvprof with tensorflow is suspiciously slow

mahmood.nt · January 19, 2019, 2:23pm

Running label_image example from tensorflow takes very little time to execute as you can see below

$ ./bazel-bin/tensorflow/examples/label_image/label_image --image=../spaceman.jpg
2019-01-19 16:49:17.894710: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-01-19 16:49:17.896497: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcuda.so.1 locally
2019-01-19 16:49:17.981149: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1003] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-01-19 16:49:17.981647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1434] Found device 0 with properties: 
name: Quadro M2000 major: 5 minor: 2 memoryClockRate(GHz): 1.1625
pciBusID: 0000:26:00.0
totalMemory: 3.95GiB freeMemory: 3.81GiB
2019-01-19 16:49:17.981667: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0
2019-01-19 16:49:17.982281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-19 16:49:17.982293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991]      0 
2019-01-19 16:49:17.982300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0:   N 
2019-01-19 16:49:17.982453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3601 MB memory) -> physical GPU (device: 0, name: Quadro M2000, pci bus id: 0000:26:00.0, compute capability: 5.2)
2019-01-19 16:49:18.102724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0
2019-01-19 16:49:18.102787: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-19 16:49:18.102796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991]      0 
2019-01-19 16:49:18.102804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0:   N 
2019-01-19 16:49:18.102947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3601 MB memory) -> physical GPU (device: 0, name: Quadro M2000, pci bus id: 0000:26:00.0, compute capability: 5.2)
2019-01-19 16:49:20.393744: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcudnn.so.7 locally
2019-01-19 16:49:22.336192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0
2019-01-19 16:49:22.336251: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-19 16:49:22.336268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991]      0 
2019-01-19 16:49:22.336284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0:   N 
2019-01-19 16:49:22.336439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3601 MB memory) -> physical GPU (device: 0, name: Quadro M2000, pci bus id: 0000:26:00.0, compute capability: 5.2)
2019-01-19 16:49:22.339939: I tensorflow/examples/label_image/main.cc:251] space shuttle (813): 0.27126
2019-01-19 16:49:22.339976: I tensorflow/examples/label_image/main.cc:251] ski (796): 0.172458
2019-01-19 16:49:22.339986: I tensorflow/examples/label_image/main.cc:251] jigsaw puzzle (612): 0.0209833
2019-01-19 16:49:22.339996: I tensorflow/examples/label_image/main.cc:251] ballplayer (982): 0.0197843
2019-01-19 16:49:22.340004: I tensorflow/examples/label_image/main.cc:251] parachute (702): 0.0134006

Please note that the time differences between steps are really small.

However, when I use nvprof with just one metric, the runtime gets extremely slow.

$ nvprof  --concurrent-kernels on --profile-api-trace all \
       --profile-from-start on --system-profiling off --unified-memory-profiling \
       per-process-device --track-memory-allocations off --device-buffer-size 8 \
       --device-cdp-buffer-size 8 --openacc-profiling on --devices 0 \
       --aggregate-mode on --metrics flop_sp_efficiency -f -o run1.nvvp \
       --log-file-verbose run1.nv.log \
       ./bazel-bin/tensorflow/examples/label_image/label_image --image=../spaceman.jpg
2019-01-19 17:48:03.081089: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-01-19 17:48:03.084178: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcuda.so.1 locally
nvprof log: /home/mahmood/tf/tensor/run1.nv.log
2019-01-19 17:48:03.388034: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1003] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-01-19 17:48:03.388553: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1434] Found device 0 with properties: 
name: Quadro M2000 major: 5 minor: 2 memoryClockRate(GHz): 1.1625
pciBusID: 0000:26:00.0
totalMemory: 3.95GiB freeMemory: 3.81GiB
2019-01-19 17:48:03.388575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0
2019-01-19 17:48:03.389218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-19 17:48:03.389230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991]      0 
2019-01-19 17:48:03.389241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0:   N 
2019-01-19 17:48:03.389417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3599 MB memory) -> physical GPU (device: 0, name: Quadro M2000, pci bus id: 0000:26:00.0, compute capability: 5.2)
2019-01-19 17:48:03.508531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0
2019-01-19 17:48:03.508597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-19 17:48:03.508603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991]      0 
2019-01-19 17:48:03.508610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0:   N 
2019-01-19 17:48:03.508760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3599 MB memory) -> physical GPU (device: 0, name: Quadro M2000, pci bus id: 0000:26:00.0, compute capability: 5.2)
2019-01-19 17:48:50.326734: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcudnn.so.7 locally
^C^C^C^Cnvprof log: /home/mahmood/tf/tensor/run1.nv.log

That is Realllly slow and after one hour, I pressed ^c to terminate that. I see that some processes are running on GPU and CPU. So I think the profiler has not been freezed.

I have to say that I have profiled the same metric for some other programs that run for 1-2 minutes and they were fine.

Any idea?

Robert_Crovella · January 19, 2019, 3:50pm

Sometimes when things that should be really quick take a long time its because I am running low on host memory.

Also make sure you are not triggering a JIT-compile of libraries.

mahmood.nt · January 19, 2019, 4:27pm

Is that a feature from nvprof command or tensorflow configure&make procedure?
Is there any way to speedup that? I am talking about nvprof options. Since I have to check about 20-30 metrics, that will be a huge waiting time and I expected to take for months!

Robert_Crovella · January 19, 2019, 6:52pm

I don’t understand the questions. They don’t seem to refer to anything I said.

I have no nvprof options to suggest to speed it up.

njuffa · January 19, 2019, 7:54pm

In addition to the points raised by Robert Crovella in #2, make sure that there aren’t any issues with writing to the log file:

/home/mahmood/tf/tensor/run1.nv.log

Is this on a network-mounted volume by any chance?

mahmood.nt · January 19, 2019, 8:23pm

I thought that disabling JIT is a nvprof related thing. It seems that I have to disable that at the configure step of TensorFLow. However, I didn’t find an option for that. If someone knows that, please let me know.

Regarding the log file, there is nothing important in it.

==15250== NVPROF is profiling process 15250, command: ./bazel-bin/tensorflow/examples/label_image/label_image --image=../spaceman.jpg
==15250== Generated result file: /home/mahmood/tf/tensor/run1.nvvp

When it is finished after a long time. Otherwise it just writes that program has been terminated by signal (^C).

Robert_Crovella · January 19, 2019, 8:37pm

Generally speaking, to avoid JIT at runtime, you have to compile your application for the specific GPU architecture you are running on.

[url]https://devblogs.nvidia.com/cuda-pro-tip-understand-fat-binaries-jit-caching/[/url]

If you did a binary install of tensorflow e.g. via pip, then you don’t really have control over this, but I think most TF wheels from reputable sources should be compiled for both sm_50 and sm_52 which should cover your GPU. (But for completeness, this should probably be confirmed/verified).

If you built TF yourself from sources (maybe you did since you are in a bazel directory), then you would want to make sure the compile configuration included the specific switches to match your GPU architecture. You can discover your GPU architecture (compute major and minor versions) using the deviceQuery sample code.

(You would also want to make sure that you are using a new enough CUDA version so that the libraries like CUDNN/CUBLAS include the binaries for your GPU. This is almost certainly not the issue on a Maxwell device.)

I’m not at all convinced this is the source of the slowdown, however. I don’t really know what the issue is.

I would run top while all this is going on, and see what the host memory situation is.

njuffa · January 19, 2019, 9:06pm

I wasn’t talking about the content of the log file. File operations can become very slow in various circumstances, slowing an application that operates on those files.

I am just brainstorming here. No strong conclusions can be drawn from your description of the situation. You would want to eliminate possible sources of slow-down one by one. Robert Crovella pointed at the most obvious / likely ones.

Trying to debug such issues by conversing on the internet is about as efficient and as effective as trying to diagnose car trouble over the phone.

Topic		Replies	Views
Why would code run 1.7x faster when run with nvprof than without? CUDA Programming and Performance	35	3690	December 28, 2017
nvprof is too slow Visual Profiler and nvprof	12	5000	January 25, 2022
nvprof: Internal profiling error 4277:5 on Tesla P100, but not on GTX 1070 Visual Profiler and nvprof	11	4166	November 27, 2018
nvprof fails with TensorRT inference task Visual Profiler and nvprof	1	919	August 21, 2018
nvprof seems to make inference slower, no tensor cores being used Jetson AGX Xavier	3	1075	January 7, 2020
nvprof --print-api-trace - puzzling outputs. Visual Profiler and nvprof	0	673	January 7, 2020
nvprof: incompatible CUDA driver version on TX2 Jetson TX2	11	3390	March 8, 2018
Profiling deadloop (replay kernel) with nvprof on deep neural network Visual Profiler and nvprof	8	3465	August 24, 2017
NVPROF power consumption Jetson Nano power	1	598	June 1, 2020
Magic of nvprof --profile-api-trace none Visual Profiler and nvprof	3	977	March 13, 2023

nvprof with tensorflow is suspiciously slow

Related topics