TAO 5.0 efficientdet: Error converting dataset to tfrecord

• Hardware A6000
• Network Type efficientdet_tf1, efficientdet_tf2
• TLT Version TAO 5.0.0
format_version: 3.0
toolkit_version: 5.0.0
published_date: 07/14/2023

I am testing Object Detection models after TAO 5.0 update. I successfully trained Deformable DETR in coco format with my custom dataset. Similarly, I tried to train Efficientdet model with the same training data, but the following error occurred during dataset_convert to tfrecords.

I followed the efficientdet notebook from tao-getting-started

The cell where the error occurred is shown below.

# convert training data to TFRecords
!tao model efficientdet_tf2 dataset_convert -e $SPECS_DIR/spec_train.yaml \
    dataset_convert.results_dir=$DATA_DOWNLOAD_DIR

The error is as follows

2023-07-31 22:22:58,832 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2023-07-31 22:22:58,928 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf2.11.0
2023-07-31 22:22:58,978 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
2023-07-31 13:23:00.044540: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1690809782.736606] [dgx-a100:44   :f]        vfs_fuse.c:424  UCX  WARN  failed to connect to vfs socket '������������������������������': Invalid argument
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-ui7ofeui because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
<frozen cv.efficientdet.scripts.dataset_convert>:345: UserWarning: 
'spec_train.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
Dataset_convert results will be saved at: /workspace/tao-experiments/data
Log file already exists at /workspace/tao-experiments/data/status.json
Starting efficientdet data conversion.
INFO:tensorflow:writing to output path: /workspace/tao-experiments/data/train
writing to output path: /workspace/tao-experiments/data/train
INFO:tensorflow:Building bounding box index.
Building bounding box index.
INFO:tensorflow:0 images are missing bboxes.
0 images are missing bboxes.
segmentation groundtruth is missing in object: 5.
Error executing job with overrides: ['dataset_convert.results_dir=/workspace/tao-experiments/data']
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "<frozen cv.efficientdet.scripts.dataset_convert>", line 218, in _pool_create_tf_example
  File "<frozen cv.efficientdet.scripts.dataset_convert>", line 134, in create_tf_example
ValueError: segmentation groundtruth is missing in object: 5.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 368, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.efficientdet.scripts.dataset_convert>", line 341, in main
  File "<frozen common.decorators>", line 88, in _func
  File "<frozen common.decorators>", line 61, in _func
  File "<frozen cv.efficientdet.scripts.dataset_convert>", line 325, in run_conversion
  File "<frozen cv.efficientdet.scripts.dataset_convert>", line 282, in _create_tf_record_from_coco_annotations
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "<frozen cv.efficientdet.scripts.dataset_convert>", line 218, in _pool_create_tf_example
  File "<frozen cv.efficientdet.scripts.dataset_convert>", line 134, in create_tf_example
ValueError: segmentation groundtruth is missing in object: 5.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/efficientdet/scripts/dataset_convert.py>", line 3, in <module>
  File "<frozen cv.efficientdet.scripts.dataset_convert>", line 345, in <module>
  File "<frozen common.hydra.hydra_runner>", line 99, in wrapper
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 367, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 251, in run_and_report
    assert mdl is not None
AssertionError

The error seems to require segmentation data from the annotation. But my dataset is Object Detection, so I only have information about the bboxes. Why does efficientdet ask for segmentation data?

Below is an excerpt from my annotation json file.

{
  "annotations": [
    {
      "id": 5,
      "image_id": 4,
      "category_id": 1,
      "bbox": [
        57,
        37,
        48,
        46
      ],
      "area": 0,
      "iscrowd": 0
    },
    {
      "id": 6,
      "image_id": 4,
      "category_id": 2,
      "bbox": [
        72.15,
        3.08,
        29.24,
        33.54
      ],
      "area": 0,
      "iscrowd": 0
    }
  ]
}

Below is the dataset_convert config portion of spec_train.json.

dataset_convert:
  image_dir: '/workspace/tao-experiments/data/raw-data/image/train/'
  annotations_file: '/workspace/tao-experiments/data/raw-data/annotations/train.json'
  results_dir: '/workspace/tao-experiments/data'
  tag: 'train'
  num_shards: 256
  include_masks: True

You can set this to False and retry.

Can I ask you one more question?
The dataset convert error is fixed, but when I train with the dataset I download from the notebook example, as well as my own custom data, I get an AP of 0.0 for all classes.

Below is the result of evaluate after retrain.

2023-08-01 19:24:26,128 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2023-08-01 19:24:26,229 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf2.11.0
2023-08-01 19:24:26,273 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
2023-08-01 10:24:27.466183: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1690885470.204502] [dgx-a100:47   :f]        vfs_fuse.c:424  UCX  WARN  failed to connect to vfs socket '������������������������������': Invalid argument
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-wk_acetx because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[1690885480.940384] [dgx-a100:340  :f]        vfs_fuse.c:424  UCX  WARN  failed to connect to vfs socket '������������������������������': Invalid argument
<frozen cv.efficientdet.scripts.evaluate>:142: UserWarning: 
'spec_retrain.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
Evaluate results will be saved at: /workspace/tao-experiments/efficientdet_tf2/experiment_dir_retrain/evaluate
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA A100-SXM4-40GB, compute capability 8.0
Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA A100-SXM4-40GB, compute capability 8.0
Starting efficientdet evaluation.
WARNING:tensorflow:AutoGraph could not transform <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x7f854f75c8b0> and will run it as-is.
Cause: Unable to locate the source code of <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x7f854f75c8b0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x7f854f75c8b0> and will run it as-is.
Cause: Unable to locate the source code of <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x7f854f75c8b0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x7f854f75ca60> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x7f854f75ca60>: no matching AST found among candidates:

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x7f854f75ca60> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x7f854f75ca60>: no matching AST found among candidates:

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x7f854f75ce50> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x7f854f75ce50>: no matching AST found among candidates:

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x7f854f75ce50> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x7f854f75ce50>: no matching AST found among candidates:

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <bound method ImageResizeLayer.call of <nvidia_tao_tf2.cv.efficientdet.layers.image_resize_layer.ImageResizeLayer object at 0x7f83f8b25820>> and will run it as-is.
Cause: Unable to locate the source code of <bound method ImageResizeLayer.call of <nvidia_tao_tf2.cv.efficientdet.layers.image_resize_layer.ImageResizeLayer object at 0x7f83f8b25820>>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <bound method ImageResizeLayer.call of <nvidia_tao_tf2.cv.efficientdet.layers.image_resize_layer.ImageResizeLayer object at 0x7f83f8b25820>> and will run it as-is.
Cause: Unable to locate the source code of <bound method ImageResizeLayer.call of <nvidia_tao_tf2.cv.efficientdet.layers.image_resize_layer.ImageResizeLayer object at 0x7f83f8b25820>>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <bound method WeightedFusion.call of <nvidia_tao_tf2.cv.efficientdet.layers.weighted_fusion_layer.WeightedFusion object at 0x7f83f8b25ac0>> and will run it as-is.
Cause: Unable to locate the source code of <bound method WeightedFusion.call of <nvidia_tao_tf2.cv.efficientdet.layers.weighted_fusion_layer.WeightedFusion object at 0x7f83f8b25ac0>>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <bound method WeightedFusion.call of <nvidia_tao_tf2.cv.efficientdet.layers.weighted_fusion_layer.WeightedFusion object at 0x7f83f8b25ac0>> and will run it as-is.
Cause: Unable to locate the source code of <bound method WeightedFusion.call of <nvidia_tao_tf2.cv.efficientdet.layers.weighted_fusion_layer.WeightedFusion object at 0x7f83f8b25ac0>>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function run_experiment.<locals>.eval_model_fn at 0x7f81b873faf0> and will run it as-is.
Cause: Unable to locate the source code of <function run_experiment.<locals>.eval_model_fn at 0x7f81b873faf0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function run_experiment.<locals>.eval_model_fn at 0x7f81b873faf0> and will run it as-is.
Cause: Unable to locate the source code of <function run_experiment.<locals>.eval_model_fn at 0x7f81b873faf0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
use max_nms_inputs for pre-nms topk.
62/63 [============================>.] - ETA: 1sloading annotations into memory...
Done (t=0.59s)
creating index...
index created!
Loading and preparing results...
Converting ndarray to lists...
(50400, 7)
0/50400
DONE (t=0.35s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=4.36s).
Accumulating evaluation results...
DONE (t=0.71s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
=============
Per class AP 
=============
AP_person: 0.000
AP_bicycle: 0.000
AP_car: 0.000
AP_motorcycle: 0.000
AP_airplane: 0.000
AP_bus: 0.000
AP_train: 0.000
AP_truck: 0.000
AP_boat: 0.000
AP_traffic light: 0.000
AP_fire hydrant: 0.000
AP_stop sign: 0.000
AP_parking meter: 0.000
AP_bench: 0.000
AP_bird: 0.000
AP_cat: 0.000
AP_dog: 0.000
AP_horse: 0.000
AP_sheep: 0.000
AP_cow: 0.000
AP_elephant: 0.000
AP_bear: 0.000
AP_zebra: 0.000
AP_giraffe: 0.000
AP_backpack: 0.000
AP_umbrella: 0.000
AP_handbag: 0.000
AP_tie: 0.000
AP_suitcase: 0.000
AP_frisbee: 0.000
AP_skis: 0.000
AP_snowboard: 0.000
AP_sports ball: 0.000
AP_kite: 0.000
AP_baseball bat: 0.000
AP_baseball glove: 0.000
AP_skateboard: 0.000
AP_surfboard: 0.000
AP_tennis racket: 0.000
AP_bottle: 0.000
AP_wine glass: 0.000
AP_cup: 0.000
AP_fork: 0.000
AP_knife: 0.000
AP_spoon: 0.000
AP_bowl: 0.000
AP_banana: 0.000
AP_apple: 0.000
AP_sandwich: 0.000
AP_orange: 0.000
AP_broccoli: 0.000
AP_carrot: 0.000
AP_hot dog: 0.000
AP_pizza: 0.000
AP_donut: 0.000
AP_cake: 0.000
AP_chair: 0.000
AP_couch: 0.000
AP_potted plant: 0.000
AP_bed: 0.000
AP_dining table: 0.000
AP_toilet: 0.000
AP_tv: 0.000
AP_laptop: 0.000
AP_mouse: 0.000
AP_remote: 0.000
AP_keyboard: 0.000
AP_cell phone: 0.000
AP_microwave: 0.000
AP_oven: 0.000
AP_toaster: 0.000
AP_sink: 0.000
AP_refrigerator: 0.000
AP_book: 0.000
AP_clock: 0.000
AP_vase: 0.000
AP_scissors: 0.000
AP_teddy bear: 0.000
AP_hair drier: -1.000
AP_toothbrush: 0.000
Evaluation finished successfully.
Sending telemetry data.
Execution status: PASS
2023-08-01 19:26:43,232 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.

What should I be looking at?

Please run with default notebook against the default dataset only. More, please evaluate the model after training.

I went with the default notebook and default dataset and still have an ap of 0.0.
This is what I did with ipynb and json.
spec_train.yaml (2.1 KB)
efficientdet.ipynb (19.8 MB)

There is one thing that makes me suspicious.
When converting to tfrecord, train_warning.json and val_warning.json are generated.

train_warnings.json

{
  "22051": {
    "box": [
      660563
    ],
    "mask": []
  },
  "60054": {
    "box": [
      2056510
    ],
    "mask": []
  },
  "77039": {
    "box": [
      1852762,
      2087312
    ],
    "mask": []
  },
  "80281": {
    "box": [
      1400800
    ],
    "mask": []
  },
  "81768": {
    "box": [
      2079773
    ],
    "mask": []
  },
  "87649": {
    "box": [
      1839125
    ],
    "mask": []
  },
  "88835": {
    "box": [
      1808447
    ],
    "mask": []
  },
  "100226": {
    "box": [
      2008730
    ],
    "mask": []
  },
  "111930": {
    "box": [
      2201465
    ],
    "mask": []
  },
  "114629": {
    "box": [
      2203355
    ],
    "mask": []
  },
  "120162": {
    "box": [
      2092190
    ],
    "mask": []
  },
  "124907": {
    "box": [
      300223
    ],
    "mask": []
  },
  "141426": {
    "box": [
      1637247
    ],
    "mask": []
  },
  "153920": {
    "box": [
      1482147
    ],
    "mask": []
  },
  "158292": {
    "box": [
      2216837
    ],
    "mask": []
  },
  "168890": {
    "box": [
      1977900
    ],
    "mask": []
  },
  "171360": {
    "box": [
      2126045
    ],
    "mask": []
  },
  "183181": {
    "box": [
      1480783
    ],
    "mask": []
  },
  "183338": {
    "box": [
      1864550
    ],
    "mask": []
  },
  "191188": {
    "box": [
      1839790
    ],
    "mask": []
  },
  "200365": {
    "box": [
      918
    ],
    "mask": []
  },
  "217989": {
    "box": [
      1893538
    ],
    "mask": []
  },
  "254449": {
    "box": [
      1816247,
      1816254
    ],
    "mask": []
  },
  "256896": {
    "box": [
      1720260
    ],
    "mask": []
  },
  "259687": {
    "box": [
      1853108
    ],
    "mask": []
  },
  "340038": {
    "box": [
      2091682
    ],
    "mask": []
  },
  "343639": {
    "box": [
      2203397
    ],
    "mask": []
  },
  "372117": {
    "box": [
      2179972
    ],
    "mask": []
  },
  "375219": {
    "box": [
      2123652
    ],
    "mask": []
  },
  "376491": {
    "box": [
      2200824
    ],
    "mask": []
  },
  "389655": {
    "box": [
      2073979
    ],
    "mask": []
  },
  "390267": {
    "box": [
      1852584
    ],
    "mask": []
  },
  "402248": {
    "box": [
      1856369
    ],
    "mask": []
  },
  "405964": {
    "box": [
      1849197
    ],
    "mask": []
  },
  "465196": {
    "box": [
      2201522
    ],
    "mask": []
  },
  "480752": {
    "box": [
      2047149
    ],
    "mask": []
  },
  "483442": {
    "box": [
      1813938
    ],
    "mask": []
  },
  "499198": {
    "box": [
      2032133
    ],
    "mask": []
  },
  "504034": {
    "box": [
      2084026
    ],
    "mask": []
  },
  "528201": {
    "box": [
      1864197
    ],
    "mask": []
  },
  "545566": {
    "box": [
      2064476
    ],
    "mask": []
  },
  "550395": {
    "box": [
      2206849
    ],
    "mask": []
  },
  "552832": {
    "box": [
      2144380
    ],
    "mask": []
  },
  "564557": {
    "box": [
      1984796
    ],
    "mask": []
  },
  "569433": {
    "box": [
      650932
    ],
    "mask": []
  }
}

val_warnings.json

{"361919": {"box": [2202383], "mask": []}}

You are running 8 gpus. Could you set lower batch_size and retry?
For example, batch_size: 2

From EfficientDet (TF2) - NVIDIA Docs
batch_size: The batch size for each GPU, so the effective batch size is batch_size_per_gpu * num_gpus.

Yes, it definitely depends on the size of the batch_size. When the batch_size is high, it doesn’t even learn at all, it just ends up saying that the learning is done. When the batch_size is 1, it seems to learn normally, but 20 epochs is not enough. Why does it have a problem when the batch_size is high? Is there a way to find the optimal batch_size to speed up the learning?

You can run experiments against different batch_size while using part of training dataset.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.