Freeze while executing Tensorflow in a Docker container on the TX2

Hi,

Thanks to open horizon, I was able to install docker with GPU support and run DIGITS in a container.
Then, next step, I wanted to run a simple tensorflow (Thanks furkankalinsaz ! https://devtalk.nvidia.com/default/topic/1030603/jetson-tx2/tensorflow-1-6-for-jetson-tx2/) script in such a container.

But it looks like the base container from open horizon https://github.com/open-horizon/cogwerx-jetson-tx2/blob/master/Dockerfile.cudabase requires something more (that the JetPack 3.2 installer does) as my tensorflow application freezes at init in the container.

Outside the container:

(tf) nvidia@tegra-ubuntu:~/projects/tensorflow$ python hello.py
2018-03-07 09:39:05.245311: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-07 09:39:05.245522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 156.65MiB
2018-03-07 09:39:05.245575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-07 09:39:06.814190: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-07 09:39:06.814309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0
2018-03-07 09:39:06.814343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N
2018-03-07 09:39:06.814689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
b'Hello, TensorFlow!'

Inside the container:

nvidia@tegra-ubuntu:~/projects/realift/src/aml$ docker run --privileged --name tf -it tensorflow:tx2 /bin/bash
root@e02ee5b67a48:/app# python3 hello.py
2018-03-07 15:00:11.427965: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-07 15:00:11.428192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 1.28GiB
2018-03-07 15:00:11.428266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0

then the container is stuck.

I add the exact same issue when installing manually the CUDA and CuDNN packages on the TX2. Applying the JetPack 3.2 installer solved the issue.

Is there somebody who can tell what is missing in the base image https://github.com/open-horizon/cogwerx-jetson-tx2/blob/master/Dockerfile.cudabase ?

Thanks !

Here is my test Dockerfile:

FROM openhorizon/aarch64-tx2-cudabase:JetPack3.2-RC
ENV ARCH=aarch64
RUN apt-get update && apt-get install -y --no-install-recommends --no-install-suggests python3-minimal python3-pip libpython3.5-dev
# Custom layers

# install ubuntu python releases
RUN apt-get install -y --no-install-recommends --no-install-suggests build-essential
RUN apt-get install -y --no-install-recommends --no-install-suggests python3-setuptools python3-all-dev python3-dev

# get precompiled TF 1.6 for JetPack 3.2 RC
RUN apt-get install -y --no-install-recommends --no-install-suggests wget
RUN wget https://github.com/openzeka/Tensorflow-for-Jetson-TX2/raw/master/Jetpack-3.2/1.6/tensorflow-1.6.0rc1-cp35-cp35m-linux_aarch64.whl
RUN pip3 install tensorflow-1.6.0rc1-cp35-cp35m-linux_aarch64.whl

WORKDIR /app
# hello.py is the TF validation script
ADD hello.py /app/
CMD [ "/usr/bin/python3", "/app/hello.py"]

Ok I spied on JetPack and it looks like it also installs :

libgomp1 libfreeimage-dev libopenmpi-dev openmpi-bin

Maybe it will solve the pb…

After some tests in fact it seems it can take A LOT of time, the FIRST time :

root@7922e0755c22:/app# python3 hello.py
2018-03-07 16:21:58.038844: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-07 16:21:58.039099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 688.65MiB
2018-03-07 16:21:58.039164: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-07 16:29:16.860720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-07 16:29:16.860824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0
2018-03-07 16:29:16.860865: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N
2018-03-07 16:29:16.861126: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 133 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
b'Hello, TensorFlow!'
root@7922e0755c22:/app# python3 hello.py
2018-03-07 16:43:06.814518: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-07 16:43:06.814760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 51.09MiB
2018-03-07 16:43:06.814821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-07 16:43:08.441989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-07 16:43:08.442148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0
2018-03-07 16:43:08.442201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N
2018-03-07 16:43:08.442411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 41 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
b'Hello, TensorFlow!'
root@7922e0755c22:/app# python3 hello.py
2018-03-07 16:43:27.350149: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-07 16:43:27.350347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 140.93MiB
2018-03-07 16:43:27.350414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-07 16:43:28.848243: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-07 16:43:28.848583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0
2018-03-07 16:43:28.848648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N
2018-03-07 16:43:28.848884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 38 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
b'Hello, TensorFlow!'
root@7922e0755c22:/app# python3 hello.py
2018-03-07 16:43:36.037751: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-07 16:43:36.038490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 343.00MiB
2018-03-07 16:43:36.038572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-07 16:43:37.462295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-07 16:43:37.462490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0
2018-03-07 16:43:37.462531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N
2018-03-07 16:43:37.462699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 143 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
b'Hello, TensorFlow!'
root@7922e0755c22:/app# python3 hello.py
2018-03-07 16:43:57.689202: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-07 16:43:57.689411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 307.52MiB
2018-03-07 16:43:57.689484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-07 16:43:59.125303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-07 16:43:59.125421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0
2018-03-07 16:43:59.125461: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N
2018-03-07 16:43:59.125714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 208 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
b'Hello, TensorFlow!'

8 minutes between Adding visible gpu devices: 0 and Device interconnect StreamExecutor with strength 1 edge matrix. Then all next executions were immediate.

Any reason ?

Just got a build of TF by NVIDIA and the lag disappeared.
Pb solved!

Cool!! it worked for me as well. Thanks for sharing it @matthieu.boujonnier, but that tf .whl file is a RC version. Is there any production version of tf, without this issue?. Did you come across such wheel file?.

Hi, saikishor

Here is some public TensorFlow wheel for Jetson for your reference:


https://devtalk.nvidia.com/default/topic/1031300

Thanks.

Could you please provide the exact links to the wheel file that you’ve used? I downloaded the wheel files from https://nvidia.app.box.com/v/TF180-Py27-wTRT , but that doesn’t do the trick for me… My Tensorflow program hangs at “Adding visible gpu devices: 0”. It takes about 8 minutes for my Tensorflow program to start, see the following logs:

2018-06-07 08:11:56.680367: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-06-07 08:11:56.680703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 4.18GiB
2018-06-07 08:11:56.680797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-07 08:19:16.380270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-07 08:19:16.380341: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-06-07 08:19:16.380366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-06-07 08:19:16.380579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 2356 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2018-06-07 08:19:16.930939: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:8080, 1 -> worker-1.default.svc.cluster.local:8080, 2 -> worker-2.default.svc.cluster.local:8080}
2018-06-07 08:19:16.931884: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:332] Started server with target: grpc://localhost:8080
2018-06-07 08:19:16.932249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-07 08:19:16.932327: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-07 08:19:16.932356: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-06-07 08:19:16.932382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-06-07 08:19:16.932515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 2356 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2018-06-07 08:19:24.259923: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 25988990f5aa60ac with config: gpu_options { per_process_gpu_memory_fraction: 0.3 }

Do you have any more insights that could be of help here? When executing the same code in TX2 baremetal these steps complete instantly, and my containers have access to a full CPU and 3Gi of RAM, so I don’t think resources are the actual bottleneck here?

@dr3dd Try to install Tensorflow from this repo, https://github.com/NVIDIA-Jetson/tf_to_trt_image_classification/tree/master#install this is not creating those freezing issues. I don’t know why this is keep on happening, but this is happening in almost all of these .whl files available in most of the repos.

The link you provided seems to provide a rather old Tensorflow 1.5 wheel file. Do you happen to know if there’s a 1.8 wheel file that doesn’t cause this lag? Or, how was that 1.5 wheel file built? I don’t mind building it myself :) Thanks!

Hello!! @dr3dd,

I am not sure how they have built the wheel file, I would suggest you to build your own tensorflow wheel file using bazel build, try to look over their website under the section installing from sources.

Hi all, just an update on the topic. I compiled the wheel myself (for Tensorflow 1.8) without TensorRT support (as I was not really using it) and with GDR and VERBS support and I got rid of the delay. You can grab it from here.

All .whl files that I’ve found accross the Internet had TensorRT support enabled. Might this be the key difference and the cause of the delay? Would that make sense?

Thanks for your kind update @dr3dd. I would like to know one thing regd this whl file. What is the advantage of building Tensorflow with TensorRT support?.

@saikishor, I’m not a deep learning expert but using TensorRT for inference generates some optimizations for Nvidia GPUs. See this for more information.

@dr3dd. Thanks for the info and the link. I am looking for such information.