run tensorflow 1.3 on tx2 stuck

Hi,i’m running tensorflow with python3.5 on tx2 but this seems unstable.It runs normally only first time i launched python script,but the else i got message like below and stuck.

2017-12-12 06:02:47.064075: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] ARM64 does not support NUMA - returning NUMA node zero
2017-12-12 06:02:47.064203: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: NVIDIA Tegra X2
major: 6 minor: 2 memoryClockRate (GHz) 1.3005
pciBusID 0000:00:00.0
Total memory: 7.67GiB
Free memory: 4.18GiB
2017-12-12 06:02:47.064255: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-12-12 06:02:47.064279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-12-12 06:02:47.064310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0)
2017-12-12 06:04:09.279612: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] ARM64 does not support NUMA - returning NUMA node zero
2017-12-12 06:04:09.279745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: NVIDIA Tegra X2
major: 6 minor: 2 memoryClockRate (GHz) 1.3005
pciBusID 0000:00:00.0
Total memory: 7.67GiB
Free memory: 4.33GiB
2017-12-12 06:04:09.279795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-12-12 06:04:09.279830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-12-12 06:04:09.279868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0)

The information about gpu shows twice, it should show only once if ran normally .

I reboot tx2 just now and got error message like this:

2017-12-12 06:21:32.375742: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] ARM64 does not support NUMA - returning NUMA node zero
2017-12-12 06:21:32.375870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: NVIDIA Tegra X2
major: 6 minor: 2 memoryClockRate (GHz) 1.3005
pciBusID 0000:00:00.0
Total memory: 7.67GiB
Free memory: 5.09GiB
2017-12-12 06:21:32.375923: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-12-12 06:21:32.376007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-12-12 06:21:32.376039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0)
2017-12-12 06:22:14.858684: E tensorflow/stream_executor/cuda/cuda_driver.cc:1068] failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED
2017-12-12 06:22:14.858769: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0xaedda10: CUDA_ERROR_LAUNCH_FAILED
2017-12-12 06:22:14.858799: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0xaedda10: CUDA_ERROR_LAUNCH_FAILED
2017-12-12 06:22:14.858956: F tensorflow/stream_executor/cuda/cuda_dnn.cc:2045] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED
2017-12-12 06:23:02.713872: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] ARM64 does not support NUMA - returning NUMA node zero
2017-12-12 06:23:02.713999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: NVIDIA Tegra X2
major: 6 minor: 2 memoryClockRate (GHz) 1.3005
pciBusID 0000:00:00.0
Total memory: 7.67GiB
Free memory: 3.72GiB
2017-12-12 06:23:02.714054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-12-12 06:23:02.714079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-12-12 06:23:02.714105: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0)

Hi,

Which TensorFlow build do you use?

Usually, we use this public build:

We can launch TensorFlow correctly with JetPack3.1.
Could you also give it a try?

Thanks.

I build my tensorflow according to https://syed-ahmed.gitbooks.io/nvidia-jetson-tx2-recipes/content/first-question.html and it runs ok.

python3
Python 3.5.2 (default,Nov 23 2017,16:37:01)
[GCC 5.4.0 20160609] on linux
Type "Help","copyright" for more information

>>> import tensorflow as tf
>>> print(tf.__version__)
1.3.0
>>>

Hi there my problem remains even Tensorflow seems work fine.Can you help me?

Hi,

Could you try Tensorflow 1.3.0 or the wheel shared in comment #4?
Based on this issue, the CUDA_ERROR_LAUNCH_FAILED error is gone after upgrading environment to TensorFlow 1.3.0 and cuDNN v6.

Thanks.

The build in https://github.com/peterlee0127/tensorflow-tx2 says it is only for Python 2.7 not Python 3.5. I’ve built TensorFlow 1.3.0 from source and used the Python 3.5 build at https://github.com/jetsonhacks/installTensorFlowJetsonTX both of which give the same errors as above.

Note: for me it does appear to work with Python 2.7 using the build from https://github.com/peterlee0127/tensorflow-tx2 just not sure why it doesn’t work with Python 3.5.

Here are problems occured by other users

https://dev-videos.com/videos/V51IO7kNXCg/TensorFlow-Install-on-NVIDIA-Jetson-TX2

https://github.com/tensorflow/tensorflow/issues/15075

Someone suggest reduce batch size ,but my script is inference not training.https://stackoverflow.com/questions/47116203/tensorflow-cuda-fails-with-error-failed-to-enqueue-convolution-on-stream-cudnn

Hi,

CUDA_ERROR_LAUNCH_FAILED usually comes from incorrect CUDA version/driver or GPU architecture.

Here is another public TensorFlow build for Python 3.5:

Could you reflash TX2 with JetPack3.1 and give this wheel a try?
Thanks.

Yes,i do flashed my tx2 with JetPack3.1 and i just uninstall & install the tensorflow as you recommend ,but error remain the same.Thank you for your help.

Hi,

Thanks for your feedback.
We will check this issue and reply information to you later.

Hi,

We can run TensorFlow correctly with python 3.5:

nvidia@tegra-ubuntu:/media/nvidia/NVIDIA$ python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2017-12-15 03:22:31.509179: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] ARM64 does not support NUMA - returning NUMA node zero
2017-12-15 03:22:31.509304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: NVIDIA Tegra X2
major: 6 minor: 2 memoryClockRate (GHz) 1.3005
pciBusID 0000:00:00.0
Total memory: 7.67GiB
Free memory: 369.01MiB
2017-12-15 03:22:31.509358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-12-15 03:22:31.509383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-12-15 03:22:31.509406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0)
>>> print(sess.run(hello))
b'Hello, TensorFlow!'

Here are our steps:
1. Flash TX2 with JetPack3.1
2. Upgrade cuDNNv7 via this package
3. Install TensorFlow

$ sudo apt-get install -y python3-pip python3-dev
$ pip3 install tensorflow-1.3.0-cp35-cp35m-linux_aarch64.whl

Could you follow our steps and check if the issue remains?
If yes, please help to test a CUDA sample for GPU functionality.

$ /usr/local/cuda-8.0/bin/cuda-install-samples-8.0.sh .
$ cd NVIDIA_CUDA-8.0_Samples/0_Simple/vectorAdd
$ make && ./vectorAdd

Thanks, and please let us know the results.

Updating cuda by tar file seems not work,the output of test is:

sudo dpkg -l | grep TensorRT
[sudo] password for nvidia: 
ii  libnvinfer-dev                        3.0.2-1+cuda8.0      arm64        TensorRT development libraries and headers
ii  libnvinfer3                           3.0.2-1+cuda8.0        arm64        TensorRT runtime libraries
ii  tensorrt-2.1.2                        3.0.2-1+cuda8.0        arm64        Meta package of TensorRT

While by installing deb file can works better ?

sudo dpkg -l | grep TensorRT

ii  libnvinfer-dev                                              4.0.0-1+cuda8.0                                       arm64        TensorRT development libraries and headers
ii  libnvinfer-samples                                          4.0.0-1+cuda8.0                                       arm64        TensorRT samples and documentation
ii  libnvinfer3                                                 3.0.2-1+cuda8.0                                       arm64        TensorRT runtime libraries
ii  libnvinfer4                                                 4.0.0-1+cuda8.0                                       arm64        TensorRT runtime libraries
ii  tensorrt                                                    3.0.0-1+cuda8.0                                       arm64        Meta package of TensorRT
ii  tensorrt-2.1.2                                              3.0.2-1+cuda8.0                                       arm64        Meta package of TensorRT

Coping test files

/usr/local/cuda-8.0/bin/cuda-install-samples-8.0.sh .
Copying samples to ./NVIDIA_CUDA-8.0_Samples now...
Finished copying samples.

Running test code:

make && ./vectorAdd
/usr/local/cuda-8.0/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_62,code=compute_62 -o vectorAdd.o -c vectorAdd.cu
/usr/local/cuda-8.0/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_62,code=compute_62 -o vectorAdd vectorAdd.o 
mkdir -p ../../bin/aarch64/linux/release
cp vectorAdd ../../bin/aarch64/linux/release
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

The Tensorflow test script can run well, but you may notice some information like “Total memory: 7.67GiB
Free memory: 369.01MiB”
.I ran my inference script and problem remain

2017-12-15 05:55:47.193361: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] ARM64 does not support NUMA - returning NUMA node zero
2017-12-15 05:55:47.193494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: NVIDIA Tegra X2
major: 6 minor: 2 memoryClockRate (GHz) 1.3005
pciBusID 0000:00:00.0
Total memory: 7.67GiB
Free memory: 3.67GiB
2017-12-15 05:55:47.193548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-12-15 05:55:47.193576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-12-15 05:55:47.193603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0)
2017-12-15 05:57:12.935098: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] ARM64 does not support NUMA - returning NUMA node zero
2017-12-15 05:57:12.935343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: NVIDIA Tegra X2
major: 6 minor: 2 memoryClockRate (GHz) 1.3005
pciBusID 0000:00:00.0
Total memory: 7.67GiB
Free memory: 3.82GiB
2017-12-15 05:57:12.935439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-12-15 05:57:12.935483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-12-15 05:57:12.935531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0)

I’m testing faster-rcnn(resnet v2) inference script, this script can be ran well by a small pc box(intel i5,4GB RAM,no GPU) in 44 secs per image (600x800).

I check my scipt running state,every time i launched my script message_server.py, there are two thread ran like:

ps -aux | grep python

nvidia   2945   39.6  2.0 1800792   165368   pts/7  Sl+  06:23   0:07    python3   message_server.py
nvidia   3021   95.5  8.2 1713920   662340   pts/7  R+   06:23   0:13    python3   message_server.py
nvidia   3034   0.0   0.0 5560      604      pts/2  S+   06:23   0:00    grep     -color=auto message_server.py

I test another script test_tensorflow.py whose content is :

import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))

and there will only one thread appear.

So the problem may be caused by competition usage of GPU by two thread? But how did this happen?

After upgrading to cuDNNv7, it works with Python 3.5 for me. Thanks AastaLLL.

Sorry,my problem remains,but updating cuDNNv7 works according to reply from garrett.floft.i’ll close this topic.

Actually I met a situation where the tensorflow on TX1 stucks and it runs very slow with both python3.5 and python 2.7. My TX1 has R28.1.
Does anyone know how to update to cudnn v7?

I fixed my tensorflow stuck with rebuilding the Terga 28.1 kernel and creating a swap file.

And the NUMA warning does not really affect things.