Hi *,
Has anyone else come across this issue when running tensorflow?
2018-03-10 15:10:00.637536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) → (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2018-03-10 15:10:02.024532: I tensorflow/core/common_runtime/gpu/gpu_device.cc:859] Could not identify NUMA node of /job:localhost/replica:0/task:0/device:GPU:0, defaulting to 0. Your kernel may not have been built with NUMA support.
2018-03-10 15:10:02.647497: E tensorflow/stream_executor/cuda/cuda_driver.cc:1110] could not synchronize on CUDA context: CUDA_ERROR_UNKNOWN :: No stack trace available
Traceback (most recent call last):
File “examples/dqn_cartpole.py”, line 41, in
dqn.compile(Adam(lr=1e-3), metrics=[‘mae’])
File “/usr/local/lib/python2.7/dist-packages/rl/agents/dqn.py”, line 160, in compile
self.target_model = clone_model(self.model, self.custom_model_objects)
File “/usr/local/lib/python2.7/dist-packages/rl/util.py”, line 15, in clone_model
clone.set_weights(model.get_weights())
File “/usr/local/lib/python2.7/dist-packages/keras/models.py”, line 699, in get_weights
return self.model.get_weights()
File “/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py”, line 2011, in get_weights
return K.batch_get_value(weights)
File “/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py”, line 2320, in batch_get_value
return get_session().run(ops)
File “/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py”, line 189, in get_session
[tf.is_variable_initialized(v) for v in candidate_vars])
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 895, in run
run_metadata_ptr)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1128, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1344, in _do_run
options, run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1363, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed
The Issue seems to be intermittent and happens pretty much every day. Whenever I see this error, I power off the system and boot it again. I reinstall the tensorflow. After some time it works. I have no logical explanation for why I am seeing this behavior.
I installed Jetpack 3.2 and Cuda version that comes with it. I installed tensorflow 1.5 from source. If anyone figured out why this is happenning, please let me know!
Thanks
Sri