TensorFlow 1.5 on TX2 Errors

After initially following https://developer.nvidia.com/embedded/linux-tegra to install the R28.2 version of Linux for Tegra, I manually grabbed the debians from the Jetpack 3.2 developer preview (I couldn’t get Jetpack to work) and installed CuDNN 7 and Cuda 9.0 (9.0.252) onto the Jetson TX2.

I then used the Python wheel provided here (GitHub - peterlee0127/tensorflow-nvJetson: TensorFlow for NVIDIA Jetson, also include patch and script for building.) to install TensorfFlow 1.5 on the Jetson and was successful in starting up basic TensorFlow sessions.

However, I am testing out the implementation of a deep neural network for object detection. I have successfully tested the network on a desktop station running Cuda (9.0.176) and CuDNN 7. When I implemented the same network on the Jetson, it worked periodically. However, Tensorflow will sporadically begin throwing error when an inference is performed on an image.

2018-02-13 22:45:04.701424: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:881] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2018-02-13 22:45:04.701537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 4.66GiB
2018-02-13 22:45:04.701583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (de
vice: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2018-02-13 22:45:05.898156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:859] Could not identify NUMA node of /job:localhost/rep
lica:0/task:0/device:GPU:0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2018-02-13 22:45:12.327744: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:650] failed to record completion event; therefore, failed to create inter-stream dependency
2018-02-13 22:45:12.327835: E tensorflow/stream_executor/event.cc:40] could not create CUDA event: CUDA_ERROR_UNKNOWN
Segmentation fault (core dumped)

The first error relating to the NUMA node occurs every time running and appears to have little effect on the program. However, when the second error occurs, it catastrophically kills the program. I only have been successful in getting the program to execute again after rebooting the Jetson a number of times.

Any insight that could be provided into the cause of this error would be greatly appreciated. I do not believe that this is a problem with TensorFlow, as this program works successfully on both my laptop (without GPU support) and on a desktop NVidia machine using a GTX 1080 Ti. However, if it appears to be so, I will bring my concerns over to the TensorFlow Github issues instead.

As an update, I was able to get image inference running on TensorFlow 1.5 by using the wheel (GitHub - peterlee0127/tensorflow-nvJetson: TensorFlow for NVIDIA Jetson, also include patch and script for building.) provided in this post (Available: TensorFlow 1.5 for Jetson TX2 - Jetson TX2 - NVIDIA Developer Forums). Note that the author of that wheel has recompiled TensorFlow 1.5 utilizing Cuda 8 and CuDNN 6, not Cuda 9 and CuDNN 7 as mentioned in the GitHub page.

It appears that the error is within the Cuda 9 or CuDNN 7 libraries, as I have continued to utilize L4T R28.2 with the downgraded Cuda libraries with success. Hopefully this helps someone else get TensorFlow 1.5 inferences running on the Jetson!

After running the above-mentioned solution for a day, the original errors popped up again. It appears that an internal CUDA call in TensorFlow is returning an error. cuEventCreate() is returning CUDA_ERROR_UNKNOWN. Is this an issue with the drivers available for the Jetson TX2?

After some discussion on the Github TensorFlow models issues list, it was discovered that these errors are likely due to the Jetson TX2 running out of memory.

Please see object_detection: Trained SSD-Inception-v2 Inference Errors on Jetson TX2 · Issue #3390 · tensorflow/models · GitHub for more information. By limiting the amount of memory available to the GPU in a tensorflow session, CUDA errors can be avoided by ensuring that the Linux system and the CUDA system both have the memory that they require. There are code samples at the provided link describing how to do this in TensorFlow.

Hi,

Thanks to keep updating information with us.

Here is a TF-1.5 with CUDA 9.0 package for your reference:

Thanks.

thanks for the updates