Hey folks, this likely isn’t an Nvidia image issues, but just looking for some help.
So we’re spinning up P3 instances, and following Nvidias guidelines; using Docker to pull nvcr.io/nvidia/tensorflow:19.01-py3
Everything is great except when we run Tensorflow and save checkpoints to S3. There seems to be connectivity issues as per stacktrace:
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 789, in _export_saved_model_for_mode
strip_default_attrs=strip_default_attrs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 907, in _export_all_saved_models
mode=model_fn_lib.ModeKeys.PREDICT)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1011, in _add_meta_graph_for_mode
graph_saver = estimator_spec.scaffold.saver or saver.Saver(sharded=True)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1102, in __init__
self.build()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1114, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1151, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 786, in _build_internal
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 369, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 343, in _AddShardedSaveOpsForV2
sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 284, in _AddSaveOps
save = self.save_op(filename_tensor, saveables)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 202, in save_op
tensors)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1690, in save_v2
shape_and_slices=shape_and_slices, tensors=tensors, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): : Unable to connect to endpoint
[[node save/SaveV2 (defined at /data/ml/kpi_tensorflow/TFModel.py:145) = SaveV2[dtypes=[DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](save/ShardedFilename, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, global_step/_7)]]
We’ve done a few things including setting S3_REQUEST_TIMEOUT_MSEC to a large value, but the connectivity issue is still happening. Wondering if anyone has run into similar problems?