S3 Connectivity issues when using Tensorflow with AWS P3 EC2 Nvidia Volta images

Hey folks, this likely isn’t an Nvidia image issues, but just looking for some help.

So we’re spinning up P3 instances, and following Nvidias guidelines; using Docker to pull nvcr.io/nvidia/tensorflow:19.01-py3

Everything is great except when we run Tensorflow and save checkpoints to S3. There seems to be connectivity issues as per stacktrace:

File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 789, in _export_saved_model_for_mode
    strip_default_attrs=strip_default_attrs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 907, in _export_all_saved_models
    mode=model_fn_lib.ModeKeys.PREDICT)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1011, in _add_meta_graph_for_mode
    graph_saver = estimator_spec.scaffold.saver or saver.Saver(sharded=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1102, in __init__
    self.build()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1114, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1151, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 786, in _build_internal
    save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 369, in _AddShardedSaveOps
    return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 343, in _AddShardedSaveOpsForV2
    sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 284, in _AddSaveOps
    save = self.save_op(filename_tensor, saveables)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 202, in save_op
    tensors)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1690, in save_v2
    shape_and_slices=shape_and_slices, tensors=tensors, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): : Unable to connect to endpoint
         [[node save/SaveV2 (defined at /data/ml/kpi_tensorflow/TFModel.py:145)  = SaveV2[dtypes=[DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](save/ShardedFilename, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, global_step/_7)]]

We’ve done a few things including setting S3_REQUEST_TIMEOUT_MSEC to a large value, but the connectivity issue is still happening. Wondering if anyone has run into similar problems?