Tensorflow crash when making an inference on Jetson Nano

Hi, I have a Jetson Nano and I have downloaded SD image from jetson-nano-sd-card-image with jetpack 4.4 and created a Docker base image with the following Dockerfile:

FROM nvcr.io/nvidia/l4t-base:r32.4.3

WORKDIR /

RUN apt-get update && apt-get install -y --fix-missing make g++

RUN apt-get install -y --fix-missing python3-pip

RUN apt-get install -y python3-h5py

RUN DEBIAN_FRONTEND="noninteractive" apt-get -y install tzdata

RUN apt-get install -y python3-opencv

RUN apt-get install -y python3-scipy

RUN apt-get install -y python3-dev

RUN pip3 install numpy cython

RUN apt-get install -y libhdf5-serial-dev hdf5-tools libhdf5-dev zlib1g-dev zip libjpeg8-dev liblapack-dev libblas-dev gfortran

RUN pip3 install -U pip testresources setuptools

RUN pip3 install -U numpy==1.16.1 future==0.18.2 mock==3.0.5 keras_preprocessing==1.1.1 keras_applications==1.0.8 gast==0.2.2 futures protobuf pybind11

RUN pip3 install --pre --extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v44 'tensorflow==1.15.3'

RUN pip3 install Keras==2.3.1

RUN apt-get install -y python3-opencv unzip autoconf build-essential libtool

To be able to inference the class of an image with a pretrained VGG19 classification Tensorflow model optimized to Tensorrt.

When I start my docker container like this:

docker run -it --gpus all --shm-size=4g --ulimit memlock=-1 inferencecontainer

My scripts loads frozen graph from the given path, creates a Session with the flag tf_config.gpu_options.allow_growth = True and defines input and output tensors getting them by their name tf_sess.graph.get_tensor_by_name().

This is the log of the Tensorflow device creation step:

2020-09-25 21:09:32.042986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1320] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 65 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)

(It assigns only 65MB of memory).

When I run the Session tf_sess.run(output_tensor, feed_dict) giving a loaded image in feed_dict of the expected input size, it crashes with the following Trace:

    2020-09-25 21:10:28.983061: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 59.51MiB
2020-09-25 21:10:28.983097: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 66060288 memory_limit_: 68411392 available bytes: 2351104 curr_region_allocation_bytes_: 67108864
2020-09-25 21:10:28.983141: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats: 
Limit:                    68411392
InUse:                    62403584
MaxInUse:                 62403584
NumAllocs:                      26
MaxAllocSize:             14680064

2020-09-25 21:10:28.983191: W tensorflow/core/common_runtime/bfc_allocator.cc:424] *********xx********____***********************xxx********************************************xxxxxxx
2020-09-25 21:10:28.983454: W tensorflow/core/framework/op_kernel.cc:1628] OP_REQUIRES failed at constant_op.cc:77 : Resource exhausted: OOM when allocating tensor of shape [3,3,512,512] and type float
2020-09-25 21:10:28.983619: E tensorflow/core/common_runtime/executor.cc:648] Executor failed to create kernel. Resource exhausted: OOM when allocating tensor of shape [3,3,512,512] and type float
	 [[{{node vgg19/block5_conv2/Conv2D/ReadVariableOp}}]]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [3,3,512,512] and type float
	 [[{{node vgg19/block5_conv2/Conv2D/ReadVariableOp}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/app/app/main.py", line 17, in <module>
    prediction = predictor.predict_frame(image)
  File "/app/app/Predictor.py", line 81, in predict_frame
    preds = self.tf_sess.run(self.output_tensor, feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [3,3,512,512] and type float
	 [[node vgg19/block5_conv2/Conv2D/ReadVariableOp (defined at usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'vgg19/block5_conv2/Conv2D/ReadVariableOp':
  File "usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "app/app/main.py", line 10, in <module>
    predictor = Predictor(trt_model_path,class_labels,image_size)
  File "app/app/Predictor.py", line 26, in __init__
    tf.import_graph_def(trt_graph, name="")
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/importer.py", line 405, in import_graph_def
    producer_op_list=producer_op_list)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/importer.py", line 517, in _import_graph_def_internal
    _ProcessNewOps(graph)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/importer.py", line 243, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3561, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3561, in <listcomp>
    for c_op in c_api_util.new_tf_operations(self)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3451, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

Any idea of what is causing the problem?

Thanks!

Based on this, the error is related to run out of the memory.

Please noticed that TensorFlow may occupy more memory when inferencing.
These memory is used for input, output and some intermediate tensors.

Have you tried the same model on desktop before?
If yes, would you mind to check the peak memory usage of TensorFlow first?

Thanks.