TAO Classification TF2 AssertionError: Only .hdf5, .tlt, .tltb are supported

I am trying to train EfficientNet b0 for classification using TAO toolkit_version: 4.0.0 . On the provided Jupyter Notebook, when I run

!ngc registry model list nvidia/tao/pretrained_classification_tf2:*
+-------+-------+-------+-------+-------+-------+------+-------+-------+
| Versi | Accur | Epoch | Batch | GPU   | Memor | File | Statu | Creat |
| on    | acy   | s     | Size  | Model | y Foo | Size | s     | ed    |
|       |       |       |       |       | tprin |      |       | Date  |
|       |       |       |       |       | t     |      |       |       |
+-------+-------+-------+-------+-------+-------+------+-------+-------+
| effic |       |       |       |       |       | 45.6 | UPLOA | Dec   |
| ientn |       |       |       |       |       | MB   | D_COM | 08,   |
| et_b0 |       |       |       |       |       |      | PLETE | 2022  |
+-------+-------+-------+-------+-------+-------+------+-------+-------+

Then I pull it using the command

!ngc registry model download-version nvidia/tao/pretrained_classification_tf2:efficientnet_b0 --dest $LOCAL_EXPERIMENT_DIR/pretrained_efficientnet_b0
Downloaded 38.14 MB in 19s, Download speed: 2.01 MB/s                 
--------------------------------------------------------------------------------
   Transfer id: pretrained_classification_tf2_vefficientnet_b0
   Download status: Completed
   Downloaded local path: /media/dl/SoftwareStack/classification/classification_tf2/pretrained_efficientnet_b0/pretrained_classification_tf2_vefficientnet_b0-1
   Total files downloaded: 4
   Total downloaded size: 38.14 MB
   Started at: 2023-01-21 02:28:59.105839
   Completed at: 2023-01-21 02:29:18.131364
   Duration taken: 19s
----------------------------

But I see the folder has the following files:

β”œβ”€β”€ keras_metadata.pb
β”œβ”€β”€ saved_model.pb
└── variables
    β”œβ”€β”€ variables.data-00000-of-00001
    └── variables.index

In the spec.yaml file, I set the model path as -

train:
  qat: False
  pretrained_model_path: '/workspace/tao-experiments/classification_tf2/pretrained_efficientnet_b0/pretrained_classification_tf2_vefficientnet_b0/saved_model.pb'
  batch_size_per_gpu: 64
  num_epochs: 200
  optim_config:
    optimizer: 'sgd'
  lr_config:
    scheduler: 'cosine'
    learning_rate: 0.05
    soft_start: 0.05
  reg_config:
    type: 'L2'
    scope: ['conv2d', 'dense']
    weight_decay: 0.00005
model:
  arch: 'efficientnet-b0'
  input_image_size: [3,256,256]
  input_image_depth: 8

But running training
!tao classification_tf2 train -e $SPECS_DIR/spec.yaml
gives error:

Starting classification training.
Found 1685440 images belonging to 11 classes.
Processing dataset (train): /workspace/tao-experiments/data/train
Found 57512 images belonging to 11 classes.
Processing dataset (validation): /workspace/tao-experiments/data/val
Only .hdf5, .tlt, .tltb are supported.
Error executing job with overrides: []
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 368, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 88, in _patched_hydra_run
    return PatchHydra._original_hydra_run(self, config_name, task_function, overrides, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 170, in _patched_task_function
    return task_function(a_config, *a_args, **a_kwargs)
  File "<frozen cv.classification.scripts.train>", line 408, in main
  File "<frozen common.decorators>", line 76, in _func
  File "<frozen common.decorators>", line 49, in _func
  File "<frozen cv.classification.scripts.train>", line 319, in run_experiment
  File "<frozen cv.classification.utils.helper>", line 364, in load_model
AssertionError: Only .hdf5, .tlt, .tltb are supported.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/classification/scripts/train.py>", line 3, in <module>
  File "<frozen cv.classification.scripts.train>", line 412, in <module>
  File "<frozen common.hydra.hydra_runner>", line 87, in wrapper
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 367, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 251, in run_and_report
    assert mdl is not None
AssertionError
Sending telemetry data.
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-01-21 02:37:28,672 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

As the error depicts the training model expects to be Only .hdf5, .tlt, .tltb, I don’t see any downloaded pretrained models with such extension.

Will check further and update to you if any.

1 Like

@Morganh Haven’t got any update for the last 4 days. Please check the issue at your earliest convenience.

Sorry for late reply. Still syncing with internal team about this issue. Apologize for the inconvenient.

Hi,
Please use below workaround.
To rename the checkpoint. Right now, the dir name is efficientnet_b0 , you just need to rename it to efficientnet_b0.hdf5. Then training will work.