I am trying to train EfficientNet b0 for classification using TAO toolkit_version: 4.0.0 . On the provided Jupyter Notebook, when I run
!ngc registry model list nvidia/tao/pretrained_classification_tf2:*
+-------+-------+-------+-------+-------+-------+------+-------+-------+
| Versi | Accur | Epoch | Batch | GPU | Memor | File | Statu | Creat |
| on | acy | s | Size | Model | y Foo | Size | s | ed |
| | | | | | tprin | | | Date |
| | | | | | t | | | |
+-------+-------+-------+-------+-------+-------+------+-------+-------+
| effic | | | | | | 45.6 | UPLOA | Dec |
| ientn | | | | | | MB | D_COM | 08, |
| et_b0 | | | | | | | PLETE | 2022 |
+-------+-------+-------+-------+-------+-------+------+-------+-------+
Then I pull it using the command
!ngc registry model download-version nvidia/tao/pretrained_classification_tf2:efficientnet_b0 --dest $LOCAL_EXPERIMENT_DIR/pretrained_efficientnet_b0
Downloaded 38.14 MB in 19s, Download speed: 2.01 MB/s
--------------------------------------------------------------------------------
Transfer id: pretrained_classification_tf2_vefficientnet_b0
Download status: Completed
Downloaded local path: /media/dl/SoftwareStack/classification/classification_tf2/pretrained_efficientnet_b0/pretrained_classification_tf2_vefficientnet_b0-1
Total files downloaded: 4
Total downloaded size: 38.14 MB
Started at: 2023-01-21 02:28:59.105839
Completed at: 2023-01-21 02:29:18.131364
Duration taken: 19s
----------------------------
But I see the folder has the following files:
βββ keras_metadata.pb
βββ saved_model.pb
βββ variables
βββ variables.data-00000-of-00001
βββ variables.index
In the spec.yaml file, I set the model path as -
train:
qat: False
pretrained_model_path: '/workspace/tao-experiments/classification_tf2/pretrained_efficientnet_b0/pretrained_classification_tf2_vefficientnet_b0/saved_model.pb'
batch_size_per_gpu: 64
num_epochs: 200
optim_config:
optimizer: 'sgd'
lr_config:
scheduler: 'cosine'
learning_rate: 0.05
soft_start: 0.05
reg_config:
type: 'L2'
scope: ['conv2d', 'dense']
weight_decay: 0.00005
model:
arch: 'efficientnet-b0'
input_image_size: [3,256,256]
input_image_depth: 8
But running training
!tao classification_tf2 train -e $SPECS_DIR/spec.yaml
gives error:
Starting classification training.
Found 1685440 images belonging to 11 classes.
Processing dataset (train): /workspace/tao-experiments/data/train
Found 57512 images belonging to 11 classes.
Processing dataset (validation): /workspace/tao-experiments/data/val
Only .hdf5, .tlt, .tltb are supported.
Error executing job with overrides: []
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 368, in <lambda>
lambda: hydra.run(
File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 88, in _patched_hydra_run
return PatchHydra._original_hydra_run(self, config_name, task_function, overrides, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 110, in run
_ = ret.return_value
File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 170, in _patched_task_function
return task_function(a_config, *a_args, **a_kwargs)
File "<frozen cv.classification.scripts.train>", line 408, in main
File "<frozen common.decorators>", line 76, in _func
File "<frozen common.decorators>", line 49, in _func
File "<frozen cv.classification.scripts.train>", line 319, in run_experiment
File "<frozen cv.classification.utils.helper>", line 364, in load_model
AssertionError: Only .hdf5, .tlt, .tltb are supported.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/classification/scripts/train.py>", line 3, in <module>
File "<frozen cv.classification.scripts.train>", line 412, in <module>
File "<frozen common.hydra.hydra_runner>", line 87, in wrapper
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 367, in _run_hydra
run_and_report(
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 251, in run_and_report
assert mdl is not None
AssertionError
Sending telemetry data.
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-01-21 02:37:28,672 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
As the error depicts the training model expects to be Only .hdf5, .tlt, .tltb, I donβt see any downloaded pretrained models with such extension.