S3 Connectivity issues when using Tensorflow with AWS P3 EC2 Nvidia Volta images

albertlim322 · April 4, 2019, 4:28am

Hey folks, this likely isn’t an Nvidia image issues, but just looking for some help.

So we’re spinning up P3 instances, and following Nvidias guidelines; using Docker to pull nvcr.io/nvidia/tensorflow:19.01-py3

Everything is great except when we run Tensorflow and save checkpoints to S3. There seems to be connectivity issues as per stacktrace:

File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 789, in _export_saved_model_for_mode
    strip_default_attrs=strip_default_attrs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 907, in _export_all_saved_models
    mode=model_fn_lib.ModeKeys.PREDICT)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1011, in _add_meta_graph_for_mode
    graph_saver = estimator_spec.scaffold.saver or saver.Saver(sharded=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1102, in __init__
    self.build()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1114, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1151, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 786, in _build_internal
    save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 369, in _AddShardedSaveOps
    return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 343, in _AddShardedSaveOpsForV2
    sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 284, in _AddSaveOps
    save = self.save_op(filename_tensor, saveables)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 202, in save_op
    tensors)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1690, in save_v2
    shape_and_slices=shape_and_slices, tensors=tensors, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): : Unable to connect to endpoint
         [[node save/SaveV2 (defined at /data/ml/kpi_tensorflow/TFModel.py:145)  = SaveV2[dtypes=[DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](save/ShardedFilename, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, global_step/_7)]]

We’ve done a few things including setting S3_REQUEST_TIMEOUT_MSEC to a large value, but the connectivity issue is still happening. Wondering if anyone has run into similar problems?

Topic		Replies	Views
An error after downloading the h5py package in TensorFlow Jetson Nano	1	3507	June 28, 2019
TensorFlow 1.7 whl issue on TX2 Jetson TX2	9	2962	October 18, 2021
Tensorflow crash when making an inference on Jetson Nano Jetson Nano jetpack , cuda , tensorflow	2	776	October 18, 2021
How to use TensorRT models with Streamlit TensorRT	1	528	July 22, 2022
Successfully installed tensorflow-2.2.0+nv20.8 but error ModuleNotFoundError: No module named 'tensorflow' occurs Jetson Nano tensorflow	3	1371	October 18, 2021
Install TensorFlow on PX2 DRIVE Hardware	12	3703	May 8, 2018
cannot convert from a tensorflow saved_model to a saved_model optimized by tensorrt TensorRT	11	2206	October 12, 2021
TAO Toolkit trainung Unet stops when saving checkpoints TAO Toolkit	19	74	September 3, 2024
use tensorflow tensorrt API convert failed TensorRT	7	2949	May 2, 2018
trouble with Tensorflow and TX2. Jetson TX2	1	1906	March 1, 2018

S3 Connectivity issues when using Tensorflow with AWS P3 EC2 Nvidia Volta images

Related topics