trouble with Tensorflow and TX2.

Hi, I’m running tensorflow 1.4.0 with python3.5 on TX2 but this seems unstable.
I run Python Script (TensorFlow tutorials), but in most cases (not every time) I meet following errors:

nvidia@tegra-ubuntu:~/classify_image$ python3 classify_image.py

2018-02-19 14:06:50.212300: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:856] ARM64 does not support NUMA - returning NUMA node zero
2018-02-19 14:06:50.212441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 5.68GiB
2018-02-19 14:06:50.212503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2018-02-19 14:06:52.777955: W tensorflow/core/framework/op_def_util.cc:334] Op BatchNormWithGlobalNormalization is deprecated. It will cease to work in GraphDef version 9. Use tf.nn.batch_normalization().
2018-02-19 14:07:06.326244: E tensorflow/stream_executor/cuda/cuda_driver.cc:1080] failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED
2018-02-19 14:07:06.326328: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0x2f09f40: CUDA_ERROR_LAUNCH_FAILED
2018-02-19 14:07:06.326373: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0x2f09f40: CUDA_ERROR_LAUNCH_FAILED
2018-02-19 14:07:06.326530: E tensorflow/stream_executor/cuda/cuda_dnn.cc:2279] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: cuDNN launch failure : input shape([1,80,73,73]) filter shape([3,3,80,192])
         [[Node: conv_4/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](conv_3, conv_4/conv2d_params)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "classify_image.py", line 227, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "classify_image.py", line 193, in main
    run_inference_on_image(image)
  File "classify_image.py", line 157, in run_inference_on_image
    {'DecodeJpeg/contents:0': image_data})
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: cuDNN launch failure : input shape([1,80,73,73]) filter shape([3,3,80,192])
         [[Node: conv_4/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](conv_3, conv_4/conv2d_params)]]

Caused by op 'conv_4/Conv2D', defined at:
  File "classify_image.py", line 227, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "classify_image.py", line 193, in main
    run_inference_on_image(image)
  File "classify_image.py", line 144, in run_inference_on_image
    create_graph()
  File "classify_image.py", line 127, in create_graph
    _ = tf.import_graph_def(graph_def, name='')
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/importer.py", line 313, in import_graph_def
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): cuDNN launch failure : input shape([1,80,73,73]) filter shape([3,3,80,192])
         [[Node: conv_4/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](conv_3, conv_4/conv2d_params)]]

2018-02-19 14:07:06.726066: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0x2f09f40: CUDA_ERROR_LAUNCH_FAILED

System Information:

  • Jetson TX2
  • JetPack 3.1
  • Python 3.5.2
  • TensorFlow 1.4.0 https://github.com/lukejocz/tensorflow-1.4.0-cp35-cp35m-linux_aarch64
  • Python script TensorFlow tutorials - "classify_image.py" https://www.tensorflow.org/tutorials/image_recognition https://github.com/tensorflow/models/tree/master/tutorials/image/imagenet/classify_image.py

Thanks.

Hi,

Could you limit the amount of GPU memory allocation and give it a try?

config = tf.ConfigProto()
config.gpu_options.allow_growth = True

session = tf.Session(config=config, ...)

Thanks.