tensorflow-gpu not using gpu?

Hi all, I’ve recently started working on a TX2 to see if I can use it to accelerate a KERAS object detection program I’ve been working on. The code is kind of a group project written in python3 and originally developed on windows machines.

I understand that using tensorRT is the best path forward, but as development is ongoing via windows I was hoping to stick to python3 (which isn’t supported by tensorRT on aarch64 yet afaik)

I built a wheel for tensorflow1.6 for python3.6 with CUDA9 support enabled and fired up my code. I was happy to find that tensorflow detected the GPU (as posted below) BUT our code still runs painfully slow.

tegrastats.sh shows gpu usage only from 0-12% while the keras python program is running, so I’d assume it is not in fact using the GPU?

I also used
sudo ./jetson_clocks.sh
and
sudo nvpmodel -m 0
with no difference

Any insights would be much appreciated, thanks!

Using TensorFlow backend.
2018-03-14 10:08:15.229050: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-14 10:08:15.229193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 4.57GiB
2018-03-14 10:08:15.229249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-14 10:08:16.792383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-14 10:08:16.792468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-03-14 10:08:16.792495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-03-14 10:08:16.792683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3946 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)

Hi,

1. Please run nvpmodel and jetson_clocks in order.

sudo nvpmodel -m 0
sudo ./jetson_clocks.sh

Rewrite the min/max frequency with nvpmodel first and then fix the clock with jetson_clocks.

2. Could you profile your TensorFlow script with nvprof and share with us?

nvprof python [tf program].py -o res.nvvp

By the way, here is some information about TensorFlow on Jetson for your reference:
https://github.com/NVIDIA-Jetson/tf_to_trt_image_classification

Thanks.

Hi, thanks for your help!

In fact it was my mistake, I have been entering nvpmodel and jetson_clocks in the correct order, I just wrote them wrong in my post.

I tried to use nvprof, but I keep receiving “Application returned non-zero code -1” and I can’t find an issue with my appliction so I haven’t been able to profile it this way.

Through some benchmarking I found that the Jetson is actually being bottlenecked the most on CPU intensive openCV tasks such as Munkres and blurring and I believe this is why overall performance is so low.

Do you have any suggestions by chance? thanks again

Hi,

  1. Please remember to run nvpmodel with sudo.

  2. Most of OpenCV functions is implemented with CPU. Alternative is

  • Use OpenCV GPU function instead: check this page for the available functions
  • Use VisionWorks API instead: check this page for details

Thanks.