Migrate SSD mobilenet v1 from TF1 to TF2

Hi everyone!!

We are migrating from TF1 to TF2 the object detection part for a project to use on an Nvidia Jetson TX2. Now, we can train an SSD Mobilenet v1 model [this] using the official “model_main_tf2.py” function, decreasing a lot the batch size to 32. We have a QUADRO P4000 GPU (8 GB).

We obtain some warnings like that, but the execution is done:

2022-02-16 13:06:30.622479: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 475.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
..

When we increase the batch size to 64 (after more warnings), we got this error:

(1) RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[64,64,320,320] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node ssd_mobile_net_v1_fpn_keras_feature_extractor/model/conv_pw_1_bn/FusedBatchNormV3}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

We understand that the device has no more free GPU memory to allocate the complete model. However, previously, we have trained the “similar” (I thought) model [this] on TF1 (1.14.0) with a maximum batch size of 100 on the same GPU using “legacy/train.py” function.

We are diving into the issue and discovered TF2 uses Eager as default, however, TF1 uses Graph as default. We read Graph is more efficient and robust than eager, however Eager is easy-to-use. Then we have changed the TF2 mode to Graph using:

from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()

However, some functions of Object Detection API 2 are no compatible with Graph mode, the first error is:

File "/home/sts/.local/lib/python3.8/site-packages/tensorflow/python/distribute/input_lib.py", line 1383, in **iter**
raise RuntimeError(" **iter** () is only supported inside of tf.function "
RuntimeError: **iter** () is only supported inside of tf.function or when eager execution is enabled.

So, there is any problem with Object Detection API 2 and TF2? why does it expend too much memory for the same model? Can we change to graph mode to solve this? how?

System:

  • Ubuntu 20.04
  • Python 3.8
  • Nvidia Driver 510
  • CUDA 11.2
  • CuDNN 8.1
  • TensorFlow 2.8
  • Object Detection API 2 (master, commit 9c8cbd0)

Thant so much for helping us

Hi,

It seems you are facing some issues with TensorFlow on the desktop environment.
It’s recommended to check with the TensorFlow team for better support instead.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.