Resource exhausted: OOM when allocating tensor with shape[5,512,50,158] .../task:0/device:GPU:0 by allocator GPU_0_bfc

I’m performing tlt-train on a default DSSD data set using the dockers container.

ENCOUNTERING THE FOLLOWING ERROR

2021-01-04 18:52:40.302479: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ****************************************************************************************************
2021-01-04 18:52:40.302542: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at conv_grad_input_ops.cc:1063 : Resource exhausted: OOM when allocating tensor with shape[5,512,50,158] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 45, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py”, line 248, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py”, line 185, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1039, in fit
validation_steps=validation_steps)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2675, in _call
fetched = self._callable_fn(*array_vals)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1472, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[5,512,50,158] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node training_1/SGD/gradients/ssd_loc_0/convolution_grad/Conv2DBackpropInput}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

I’ve reduced eval_config - batch size to 1 and training_config - batch size per group to 5 t0 lower memory usage. I’m still getting the Out of Memory message on the docker container execution.

My local machine has 16 GB RAM but I do not control the docker container memory allocation. How can I resolve the OOM message.