TensorFlow C-library with CUDA support gets stuck

This thread is to report the behavior of TensorFlow C-library on Jetson Nano. In particular, TF library with CUDA support seems to get stuck during runtime. I built TensorFlow v1.12.0 as a C-library first without CUDA support enabled and then with CUDA support enabled. I am comparing the output of these two scenarios using a simple matrix calculation benchmark code (written in Go).

Using TF C-library built without CUDA support
In this case the matrix allocation and computation output is as expected… no issues. As you can see a 100x100 matrix is allocated in about 14ms and its inverse is computed in about 10ms and so on.

$ matrix-inversion-benchmark-tf
matrix allocation: [100 100] 13.733535ms
  inv computation: [100 100] 10.961744ms
matrix allocation: [200 200] 3.695703ms
  inv computation: [200 200] 18.391433ms
matrix allocation: [500 500] 21.158223ms
  inv computation: [500 500] 194.820018ms
matrix allocation: [1000 1000] 79.596653ms
  inv computation: [1000 1000] 1.332078944s

Using TF C-library built with CUDA support
In this case we see a NUMA warning and couple other messages but the code gets stuck.

$ matrix-inversion-benchmark-tf
2019-04-20 05:45:21.795288: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero
2019-04-20 05:45:21.795471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216
pciBusID: 0000:00:00.0
totalMemory: 3.87GiB freeMemory: 1.72GiB
2019-04-20 05:45:21.795528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0

Other details
Output of ldd command on the TF C-library built using CUDA support:

$ ldd /usr/local/lib/libtensorflow.so
	linux-vdso.so.1 (0x0000007f95b04000)
	libtensorflow_framework.so => /usr/local/lib/libtensorflow_framework.so (0x0000007f84c3d000)
	libcublas.so.10.0 => /usr/local/cuda/lib64/libcublas.so.10.0 (0x0000007f8030c000)
	libcusolver.so.10.0 => /usr/local/cuda/lib64/libcusolver.so.10.0 (0x0000007f7723b000)
	libcudart.so.10.0 => /usr/local/cuda/lib64/libcudart.so.10.0 (0x0000007f771ca000)
	libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000007f77192000)
	libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000007f77166000)
	libgomp.so.1 => /usr/lib/aarch64-linux-gnu/libgomp.so.1 (0x0000007f77129000)
	libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000007f7706f000)
	librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000007f77058000)
	libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000007f76ec3000)
	libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000007f76e9f000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007f76d46000)
	/lib/ld-linux-aarch64.so.1 (0x0000007f95ad9000)
	libcuda.so.1 => /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 (0x0000007f75e22000)
	libcudnn.so.7 => /usr/lib/aarch64-linux-gnu/libcudnn.so.7 (0x0000007f6180d000)
	libcufft.so.10.0 => /usr/local/cuda/lib64/libcufft.so.10.0 (0x0000007f598fe000)
	libcurand.so.10.0 => /usr/local/cuda/lib64/libcurand.so.10.0 (0x0000007f556ff000)
	libnvrm_gpu.so => /usr/lib/aarch64-linux-gnu/tegra/libnvrm_gpu.so (0x0000007f556bc000)
	libnvrm.so => /usr/lib/aarch64-linux-gnu/tegra/libnvrm.so (0x0000007f5567a000)
	libnvrm_graphics.so => /usr/lib/aarch64-linux-gnu/tegra/libnvrm_graphics.so (0x0000007f5565b000)
	libnvidia-fatbinaryloader.so.32.1.0 => /usr/lib/aarch64-linux-gnu/tegra/libnvidia-fatbinaryloader.so.32.1.0 (0x0000007f555fd000)
	libnvos.so => /usr/lib/aarch64-linux-gnu/tegra/libnvos.so (0x0000007f555df000)

Output of ldd command on TF C-library built without CUDA support:

$ ldd ./libtensorflow.so
	linux-vdso.so.1 (0x0000007f80311000)
	libtensorflow_framework.so (0x0000007f7c63c000)
	libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000007f7c604000)
	libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000007f7c5d8000)
	libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000007f7c51e000)
	librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000007f7c507000)
	libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000007f7c372000)
	libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000007f7c34e000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007f7c1f5000)
	/lib/ld-linux-aarch64.so.1 (0x0000007f802e6000)

Appreciate any help towards resolution of this issue. As a side note, I tried building TF v1.13.1 but failed due to apparent gcc issue reported here: https://github.com/tensorflow/tensorflow/issues/27931


May I know which compute capacity do you use for building the C++ library?
Please noticed that Nano is sm=5.3.

Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,5.2] 5.3


Hi @AastaLLL, I had the default compute capability 3.5,7.0 for the previous build. I started a new build with 5.3

Does NCCL value of 1.3 look good?

Please specify the NCCL version you want to use. If NCCL 2.2 is not installed, then you can use version 1.3 that can be fetched automatically but it may have worse performance with multiple GPUs. [Default is 2.2]: 1.3

Also wondering if you could share TF v1.13.1 build steps being used by Nvidia. I could not build TF v1.13.1 with gcc 7.3

Thank you.


NCCL only supports PCIe based GPU. Please turn it off when building TensorFlow.

For example:

$ bazel build --config=opt --local_resources 2048,3.0,1.0 --config=cuda --config=nonccl //tensorflow/tools/pip_package:build_pip_package


Hi AastaLLL,

Thank you so much for your input. GPU build came out fine! All seems to be working but I’ll start another build with bazel CLI options you listed above.