Hi, I have a Jetson Nano and I have downloaded SD image from jetson-nano-sd-card-image with jetpack 4.4 and created a Docker base image with the following Dockerfile:
FROM nvcr.io/nvidia/l4t-base:r32.4.3
WORKDIR /
RUN apt-get update && apt-get install -y --fix-missing make g++
RUN apt-get install -y --fix-missing python3-pip
RUN apt-get install -y python3-h5py
RUN DEBIAN_FRONTEND="noninteractive" apt-get -y install tzdata
RUN apt-get install -y python3-opencv
RUN apt-get install -y python3-scipy
RUN apt-get install -y python3-dev
RUN pip3 install numpy cython
RUN apt-get install -y libhdf5-serial-dev hdf5-tools libhdf5-dev zlib1g-dev zip libjpeg8-dev liblapack-dev libblas-dev gfortran
RUN pip3 install -U pip testresources setuptools
RUN pip3 install -U numpy==1.16.1 future==0.18.2 mock==3.0.5 keras_preprocessing==1.1.1 keras_applications==1.0.8 gast==0.2.2 futures protobuf pybind11
RUN pip3 install --pre --extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v44 'tensorflow==1.15.3'
RUN pip3 install Keras==2.3.1
RUN apt-get install -y python3-opencv unzip autoconf build-essential libtool
To be able to inference the class of an image with a pretrained VGG19 classification Tensorflow model optimized to Tensorrt.
When I start my docker container like this:
docker run -it --gpus all --shm-size=4g --ulimit memlock=-1 inferencecontainer
My scripts loads frozen graph from the given path, creates a Session with the flag tf_config.gpu_options.allow_growth = True
and defines input and output tensors getting them by their name tf_sess.graph.get_tensor_by_name()
.
This is the log of the Tensorflow device creation step:
2020-09-25 21:09:32.042986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1320] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 65 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
(It assigns only 65MB of memory).
When I run the Session tf_sess.run(output_tensor, feed_dict)
giving a loaded image in feed_dict of the expected input size, it crashes with the following Trace:
2020-09-25 21:10:28.983061: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 59.51MiB
2020-09-25 21:10:28.983097: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 66060288 memory_limit_: 68411392 available bytes: 2351104 curr_region_allocation_bytes_: 67108864
2020-09-25 21:10:28.983141: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit: 68411392
InUse: 62403584
MaxInUse: 62403584
NumAllocs: 26
MaxAllocSize: 14680064
2020-09-25 21:10:28.983191: W tensorflow/core/common_runtime/bfc_allocator.cc:424] *********xx********____***********************xxx********************************************xxxxxxx
2020-09-25 21:10:28.983454: W tensorflow/core/framework/op_kernel.cc:1628] OP_REQUIRES failed at constant_op.cc:77 : Resource exhausted: OOM when allocating tensor of shape [3,3,512,512] and type float
2020-09-25 21:10:28.983619: E tensorflow/core/common_runtime/executor.cc:648] Executor failed to create kernel. Resource exhausted: OOM when allocating tensor of shape [3,3,512,512] and type float
[[{{node vgg19/block5_conv2/Conv2D/ReadVariableOp}}]]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [3,3,512,512] and type float
[[{{node vgg19/block5_conv2/Conv2D/ReadVariableOp}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/app/app/main.py", line 17, in <module>
prediction = predictor.predict_frame(image)
File "/app/app/Predictor.py", line 81, in predict_frame
preds = self.tf_sess.run(self.output_tensor, feed_dict)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [3,3,512,512] and type float
[[node vgg19/block5_conv2/Conv2D/ReadVariableOp (defined at usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Original stack trace for 'vgg19/block5_conv2/Conv2D/ReadVariableOp':
File "usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "app/app/main.py", line 10, in <module>
predictor = Predictor(trt_model_path,class_labels,image_size)
File "app/app/Predictor.py", line 26, in __init__
tf.import_graph_def(trt_graph, name="")
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
return func(*args, **kwargs)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/importer.py", line 405, in import_graph_def
producer_op_list=producer_op_list)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/importer.py", line 517, in _import_graph_def_internal
_ProcessNewOps(graph)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/importer.py", line 243, in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3561, in _add_new_tf_operations
for c_op in c_api_util.new_tf_operations(self)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3561, in <listcomp>
for c_op in c_api_util.new_tf_operations(self)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3451, in _create_op_from_tf_operation
ret = Operation(c_op, self)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
Any idea of what is causing the problem?
Thanks!