Thanks to open horizon, I was able to install docker with GPU support and run DIGITS in a container.
Then, next step, I wanted to run a simple tensorflow (Thanks furkankalinsaz ! Tensorflow 1.6 for Jetson TX2 - Jetson TX2 - NVIDIA Developer Forums) script in such a container.
(tf) nvidia@tegra-ubuntu:~/projects/tensorflow$ python hello.py
2018-03-07 09:39:05.245311: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-07 09:39:05.245522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 156.65MiB
2018-03-07 09:39:05.245575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-07 09:39:06.814190: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-07 09:39:06.814309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-03-07 09:39:06.814343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-03-07 09:39:06.814689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 60 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
b'Hello, TensorFlow!'
Inside the container:
nvidia@tegra-ubuntu:~/projects/realift/src/aml$ docker run --privileged --name tf -it tensorflow:tx2 /bin/bash
root@e02ee5b67a48:/app# python3 hello.py
2018-03-07 15:00:11.427965: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-07 15:00:11.428192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 1.28GiB
2018-03-07 15:00:11.428266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
then the container is stuck.
I add the exact same issue when installing manually the CUDA and CuDNN packages on the TX2. Applying the JetPack 3.2 installer solved the issue.
After some tests in fact it seems it can take A LOT of time, the FIRST time :
root@7922e0755c22:/app# python3 hello.py
2018-03-07 16:21:58.038844: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-07 16:21:58.039099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 688.65MiB
2018-03-07 16:21:58.039164: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-07 16:29:16.860720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-07 16:29:16.860824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-03-07 16:29:16.860865: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-03-07 16:29:16.861126: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 133 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
b'Hello, TensorFlow!'
root@7922e0755c22:/app# python3 hello.py
2018-03-07 16:43:06.814518: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-07 16:43:06.814760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 51.09MiB
2018-03-07 16:43:06.814821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-07 16:43:08.441989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-07 16:43:08.442148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-03-07 16:43:08.442201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-03-07 16:43:08.442411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 41 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
b'Hello, TensorFlow!'
root@7922e0755c22:/app# python3 hello.py
2018-03-07 16:43:27.350149: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-07 16:43:27.350347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 140.93MiB
2018-03-07 16:43:27.350414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-07 16:43:28.848243: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-07 16:43:28.848583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-03-07 16:43:28.848648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-03-07 16:43:28.848884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 38 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
b'Hello, TensorFlow!'
root@7922e0755c22:/app# python3 hello.py
2018-03-07 16:43:36.037751: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-07 16:43:36.038490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 343.00MiB
2018-03-07 16:43:36.038572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-07 16:43:37.462295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-07 16:43:37.462490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-03-07 16:43:37.462531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-03-07 16:43:37.462699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 143 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
b'Hello, TensorFlow!'
root@7922e0755c22:/app# python3 hello.py
2018-03-07 16:43:57.689202: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-07 16:43:57.689411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 307.52MiB
2018-03-07 16:43:57.689484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-07 16:43:59.125303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-07 16:43:59.125421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-03-07 16:43:59.125461: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-03-07 16:43:59.125714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 208 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
b'Hello, TensorFlow!'
8 minutes between Adding visible gpu devices: 0 and Device interconnect StreamExecutor with strength 1 edge matrix. Then all next executions were immediate.
Cool!! it worked for me as well. Thanks for sharing it @matthieu.boujonnier, but that tf .whl file is a RC version. Is there any production version of tf, without this issue?. Did you come across such wheel file?.
Could you please provide the exact links to the wheel file that you’ve used? I downloaded the wheel files from Box , but that doesn’t do the trick for me… My Tensorflow program hangs at “Adding visible gpu devices: 0”. It takes about 8 minutes for my Tensorflow program to start, see the following logs:
2018-06-07 08:11:56.680367: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-06-07 08:11:56.680703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 4.18GiB
2018-06-07 08:11:56.680797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-07 08:19:16.380270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-07 08:19:16.380341: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-06-07 08:19:16.380366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-06-07 08:19:16.380579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 2356 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2018-06-07 08:19:16.930939: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:8080, 1 -> worker-1.default.svc.cluster.local:8080, 2 -> worker-2.default.svc.cluster.local:8080}
2018-06-07 08:19:16.931884: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:332] Started server with target: grpc://localhost:8080
2018-06-07 08:19:16.932249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-07 08:19:16.932327: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-07 08:19:16.932356: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-06-07 08:19:16.932382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-06-07 08:19:16.932515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 2356 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2018-06-07 08:19:24.259923: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 25988990f5aa60ac with config: gpu_options { per_process_gpu_memory_fraction: 0.3 }
Do you have any more insights that could be of help here? When executing the same code in TX2 baremetal these steps complete instantly, and my containers have access to a full CPU and 3Gi of RAM, so I don’t think resources are the actual bottleneck here?
The link you provided seems to provide a rather old Tensorflow 1.5 wheel file. Do you happen to know if there’s a 1.8 wheel file that doesn’t cause this lag? Or, how was that 1.5 wheel file built? I don’t mind building it myself :) Thanks!
I am not sure how they have built the wheel file, I would suggest you to build your own tensorflow wheel file using bazel build, try to look over their website under the section installing from sources.
Hi all, just an update on the topic. I compiled the wheel myself (for Tensorflow 1.8) without TensorRT support (as I was not really using it) and with GDR and VERBS support and I got rid of the delay. You can grab it from here.
All .whl files that I’ve found accross the Internet had TensorRT support enabled. Might this be the key difference and the cause of the delay? Would that make sense?
Thanks for your kind update @dr3dd. I would like to know one thing regd this whl file. What is the advantage of building Tensorflow with TensorRT support?.
@saikishor, I’m not a deep learning expert but using TensorRT for inference generates some optimizations for Nvidia GPUs. See this for more information.