ResourceExhaustedError - Unable to run Tensorflow DeepLab model demo local_test.sh

L4T 32.2.0
Ubuntu 18.04.3
Kernel Version: 4.9.140-tegra
Python: 3.6.8
CUDA 10.0.326
Xavier PWR Mode: MAXN
Tensorflow: v1.14.0
Model: DeepLab

model_test.py is successful, but local_test.sh fails running the following
(from dir tensorflow/models/research/deeplab):

$  sh local_test.sh

Error:

2019-08-27 17:02:47.700148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 120 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
2019-08-27 17:03:05.437609: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4B (rounded to 256).  Current allocation summary follows.
2019-08-27 17:03:05.437756: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (256): 	Total Chunks: 96, Chunks in use: 96. 24.0KiB allocated for chunks. 24.0KiB in use in bin. 384B client-requested in use in bin.
2019-08-27 17:03:05.437818: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (512): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-08-27 17:03:05.437868: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (1024): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.

Eventually followed by:

2019-08-27 17:03:05.439940: I tensorflow/core/common_runtime/bfc_allocator.cc:780] Bin for 256B was 256B, Chunk State: 
2019-08-27 17:03:05.440025: I tensorflow/core/common_runtime/bfc_allocator.cc:793] Next region of size 125837312
2019-08-27 17:03:05.440110: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x209e99000 next 1 of size 256
2019-08-27 17:03:05.440194: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x209e99100 next 2 of size 3072
2019-08-27 17:03:05.440277: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x209e99d00 next 3 of size 3072

Eventually followed by:

2019-08-27 17:03:05.461770: I tensorflow/core/common_runtime/bfc_allocator.cc:809]      Summary of in-use Chunks by size: 
2019-08-27 17:03:05.461800: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 96 Chunks of size 256 totalling 24.0KiB
2019-08-27 17:03:05.461820: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 480 Chunks of size 3072 totalling 1.41MiB
2019-08-27 17:03:05.461841: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 48 Chunks of size 26368 totalling 1.21MiB
2019-08-27 17:03:05.461861: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 47 Chunks of size 2119936 totalling 95.02MiB
2019-08-27 17:03:05.461880: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 23435520 totalling 22.35MiB
2019-08-27 17:03:05.461936: I tensorflow/core/common_runtime/bfc_allocator.cc:816] Sum Total of in-use chunks: 120.01MiB
2019-08-27 17:03:05.461963: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 125837312 memory_limit_: 125837312 available bytes: 0 curr_region_allocation_bytes_: 251674624
2019-08-27 17:03:05.462013: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats: 
Limit:                   125837312
InUse:                   125837312
MaxInUse:                125837312
NumAllocs:                     672
MaxAllocSize:             23435520

2019-08-27 17:03:05.462086: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ********************************************************************************************xxxxxxxx
2019-08-27 17:03:05.462150: W tensorflow/core/framework/op_kernel.cc:1479] OP_REQUIRES failed at constant_op.cc:77 : Resource exhausted: OOM when allocating tensor of shape [] and type float
2019-08-27 17:03:05.462213: E tensorflow/core/common_runtime/executor.cc:648] Executor failed to create kernel. Resource exhausted: OOM when allocating tensor of shape [] and type float
	 [[{{node xception_65/exit_flow/block2/unit_1/xception_module/separable_conv3_pointwise/weights/Initializer/truncated_normal/stddev}}]]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [] and type float
	 [[{{node xception_65/exit_flow/block2/unit_1/xception_module/separable_conv3_pointwise/weights/Initializer/truncated_normal/stddev}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/NVMe500/workspace/models/research/deeplab/train.py", line 513, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/mnt/NVMe500/workspace/models/research/deeplab/train.py", line 505, in main
    hooks=[stop_hook]) as sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1007, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1200, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1205, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 871, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/session_manager.py", line 296, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [] and type float
	 [[node xception_65/exit_flow/block2/unit_1/xception_module/separable_conv3_pointwise/weights/Initializer/truncated_normal/stddev (defined at mnt/NVMe500/workspace/models/research/deeplab/core/xception.py:180) ]]

Original stack trace for 'xception_65/exit_flow/block2/unit_1/xception_module/separable_conv3_pointwise/weights/Initializer/truncated_normal/stddev':
  File "mnt/NVMe500/workspace/models/research/deeplab/train.py", line 513, in <module>
    tf.app.run()

Am new to TF, would appreciate any advice on how to resolve the error.

Thank you in advance!

Hi,

Based on the error message, the device is running out of memory:

ResourceExhaustedError: OOM when allocating tensor of shape [] and type float

Could you reboot the system and try it again?
It’s known that memory may not be freed correctly from TensorFlow session under some unexpected scenario.

If the error keeps showing, would you mind to monitor the system status at the same time and share the log with us.

sudo tegrastats

Thanks.

This worked swimmingly, thank you for your help!