Tensorflow on TX2 GPU sync error

austin.podoll · March 19, 2018, 8:57pm

I have installed tensorflow on my TX2 dev board using the instructions from:

For the most part everything works, I even installed Keras on top of it which makes development much faster and easier. However, I’ve started getting a very strange error sometimes when executing my neural networks.

Tensorflow is able to sense the GPU, but during the step of ‘Creating TensorFlow device’ it fails with the error:
GPU sync failed.

Has anyone else come across this error before? I’ve tried reinstalling tensorflow a few times but the error persists.

AastaLLL · March 20, 2018, 2:26am

Hi,

A common cause is the incompatible CUDA libraries/driver version.
Please remember that the wheel file shared here requires JetPack 3.2(CUDA 9) environment.

If you already in JetPack3.2 environment, could you share detail error log with us?

Thanks.

austin.podoll · March 20, 2018, 12:24pm

I do have JetPack 3.2, but I had the developer preview before upgrading to the production release. So I’m going to try and uninstall JetPack from my host machine and reinstall everything fresh as opposed to upgrading. Below is my error log:

/home/nvidia/.local/lib/python2.7/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
2018-03-20 12:21:16.070911: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-03-20 12:21:16.071098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 5.60GiB
2018-03-20 12:21:16.071150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1300] Adding visible gpu device 0
2018-03-20 12:21:17.330181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:987] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5115 MB memory) → physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2018-03-20 12:21:18.116623: E tensorflow/stream_executor/cuda/cuda_driver.cc:1110] could not synchronize on CUDA context: CUDA_ERROR_UNKNOWN :: *** Begin stack trace ***
perftools::gputools::cuda::CUDADriver::SynchronizeContext(perftools::gputools::cuda::CudaContext*)
perftools::gputools::StreamExecutor::SynchronizeAllActivity()
tensorflow::GPUUtil::SyncAll(tensorflow::Device*)
*** End stack trace ***

Traceback (most recent call last):
File “Python2FastBoatFinder.py”, line 18, in
model = load_model(‘…/…/models/Gen5_Ship_Classifier.h5’)
File “/home/nvidia/.local/lib/python2.7/site-packages/keras/models.py”, line 246, in load_model
topology.load_weights_from_hdf5_group(f[‘model_weights’], model.layers)
File “/home/nvidia/.local/lib/python2.7/site-packages/keras/engine/topology.py”, line 3382, in load_weights_from_hdf5_group
K.batch_set_value(weight_value_tuples)
File “/home/nvidia/.local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py”, line 2373, in batch_set_value
get_session().run(assign_ops, feed_dict=feed_dict)
File “/home/nvidia/.local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py”, line 192, in get_session
[tf.is_variable_initialized(v) for v in candidate_vars])
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 895, in run
run_metadata_ptr)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1128, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1344, in _do_run
options, run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1363, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

austin.podoll · March 20, 2018, 5:19pm

I have an update.

I reinstalled Jetpack 3.2 on my host and re-flashed my TX2. The problem persists, but it may be related to the amount of freeMemory that tensorflow senses. So far every time the freeMemory is over 5GiB the GPU sync error occurs. It sometimes occurs when the freeMemory is in the 4GiB range, but sometimes other memory errors occur instead. I have not seen the error when the freeMemory is in the 3GiB range.

austin.podoll · March 22, 2018, 3:21pm

Well I have a solution. I am using the following code to reduce the amount of GPU memory that tensorflow grabs:

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.3
set_session(tf.Session(config=config))

I can put the GPU memory fraction as high as 0.5, but that only works sometimes. 0.3 has yet to error on me.

I think the problem is that tensorflow by default wants to allocate all the GPU memory available. But in the case of the TX2 the GPU memory is also the CPU memory so maybe the OS or other processes won’t allow tensorflow to allocate the memory it wants. I guess really what I need to do is learn CUDA so I can take full advantage of the TX2!

AastaLLL · March 23, 2018, 7:47am

Hi,

Thanks for sharing your status with us.
This is a known issue for TensorFlow on Jetson.

It’s recommended to limit the query amount of TensorFlow via this configuration:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True

session = tf.Session(config=config, ...)

Per Tensor RT documentation,
------
by default it will try to allocate all the available GPU memory.
------

The available memory for TX2 may be very high (up to 6.2 GB) sometime.
On iGPU environment, such a huge memory allocation will fail in general as host and GPU share the same memory.
The workaround restrict the amount of memory allocation.

You can check this topic for more information.
https://devtalk.nvidia.com/default/topic/1029742/jetson-tx2/tensorflow-1-6-not-working-with-jetpack-3-2/

Thanks.

Topic		Replies	Views
GPU Sync failed in TX2 when running Tensorflow Jetson TX2	7	5285	October 18, 2021
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed Jetson TX2	8	6296	October 18, 2021
failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED Jetson TX2	10	1262	March 1, 2018
Questions about GPU Sync failed Frameworks (archived) tensorflow	4	3196	July 23, 2019
trouble with Tensorflow and TX2. Jetson TX2	1	1913	March 1, 2018
Trying to execute tensorflow with GPU support on my Jetson TX2, but having error. Jetson TX2	2	1092	October 18, 2021
Tensorflow Memory Error Jetson TX2	25	15324	October 18, 2021
run tensorflow 1.3 on tx2 stuck Jetson TX2	20	5616	October 18, 2021
SSD: functioned well on CPU but failed on GPU Jetson TX2	7	863	October 18, 2021
CUDA Fail when running Tensorflow inference Jetson TX2	10	3351	February 2, 2018

Tensorflow on TX2 GPU sync error

Related topics