GPU Sync failed in TX2 when running Tensorflow

Hi *,
Has anyone else come across this issue when running tensorflow?

2018-03-10 15:10:00.637536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) → (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2018-03-10 15:10:02.024532: I tensorflow/core/common_runtime/gpu/gpu_device.cc:859] Could not identify NUMA node of /job:localhost/replica:0/task:0/device:GPU:0, defaulting to 0. Your kernel may not have been built with NUMA support.
2018-03-10 15:10:02.647497: E tensorflow/stream_executor/cuda/cuda_driver.cc:1110] could not synchronize on CUDA context: CUDA_ERROR_UNKNOWN :: No stack trace available
Traceback (most recent call last):
File “examples/dqn_cartpole.py”, line 41, in
dqn.compile(Adam(lr=1e-3), metrics=[‘mae’])
File “/usr/local/lib/python2.7/dist-packages/rl/agents/dqn.py”, line 160, in compile
self.target_model = clone_model(self.model, self.custom_model_objects)
File “/usr/local/lib/python2.7/dist-packages/rl/util.py”, line 15, in clone_model
clone.set_weights(model.get_weights())
File “/usr/local/lib/python2.7/dist-packages/keras/models.py”, line 699, in get_weights
return self.model.get_weights()
File “/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py”, line 2011, in get_weights
return K.batch_get_value(weights)
File “/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py”, line 2320, in batch_get_value
return get_session().run(ops)
File “/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py”, line 189, in get_session
[tf.is_variable_initialized(v) for v in candidate_vars])
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 895, in run
run_metadata_ptr)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1128, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1344, in _do_run
options, run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1363, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed

The Issue seems to be intermittent and happens pretty much every day. Whenever I see this error, I power off the system and boot it again. I reinstall the tensorflow. After some time it works. I have no logical explanation for why I am seeing this behavior.

I installed Jetpack 3.2 and Cuda version that comes with it. I installed tensorflow 1.5 from source. If anyone figured out why this is happenning, please let me know!

Thanks
Sri

Hi,

This is a known issue of TensorFlow on Jetson.
We found that this issue often occurs when TensorFlow want to allocate more than 5Gb GPU memory.

Try this command which may help:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True

session = tf.Session(config=config, ...)

Thanks.

Hi,

I tried the above commands, and it does typically overcome the GPU Sync failure.

However, the change in configuration slows the entire process down.

A task that I did before reached 6.87 G, but after configuring it (since I went too much and caused a GPU Sync failure), the same task takes 7.57G .

Essentially, it is significantly slower, and I wanted to know if there was another fix for this GPU Sync failure, or if there was a way to almost “flush” the memory.

Thanks!

As a follow up, I tried the config = tf.ConfigProto() code again, and this time it did NOT fix my GPU Sync failure error.

Any other advice on how to fix this issue?
Thanks!

Hi,

There is no available fix on this issue currently. Here is some information:

We are working on building an official TensorFlow package for Jetson.
It will include several particular change and a complete testing on it.

This package will provide user a more stable environment and hope your issue can also be fixed at that time.
Please wait for our announcement for the latest information.

Thanks.

hi, i have the same problem, and i fix it when i install libhdf5-dev, python-h5py .
you can try it

sudo apt-get install libhdf5-dev
sudo apt-get install python-h5py

and set gpu allow growth

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
set_session(sess)

i hope it can help you.

Thanks for the sharing. : )