TAO re_identification export failure

• Hardware: A6000
• Network Type: re_identification
• TAO version: 5.0.0
• Training spec file:

model:
  backbone: resnet_50
  last_stride: 1
  pretrain_choice: imagenet
  pretrained_model_path: "/workspace/tao-experiments/models/resnet50_market1501_aicity156.tlt"
  input_channels: 3
  input_width: 128
  input_height: 256
  neck: bnneck
  feat_dim: 256
  neck_feat: after
  metric_loss_type: triplet
  with_center_loss: False
  with_flip_feature: False
  label_smooth: True
dataset:
  train_dataset_dir: "/workspace/tao-experiments/data/bounding_box_train"
  test_dataset_dir: "/workspace/tao-experiments/data/bounding_box_test"
  query_dataset_dir: "/workspace/tao-experiments/data/query"
  num_classes: 60
  batch_size: 64
  val_batch_size: 128
  num_workers: 1
  pixel_mean: [0.485, 0.456, 0.406]
  pixel_std: [0.226, 0.226, 0.226]
  padding: 10
  prob: 0.5
  re_prob: 0.5
  sampler: softmax_triplet
  num_instances: 4
re_ranking:
  re_ranking: True
  k1: 20
  k2: 6
  lambda_value: 0.3
train:
  optim:
    name: Adam
    lr_monitor: val_loss
    steps: [10, 20]
    gamma: 0.1
    bias_lr_factor: 1
    weight_decay: 0.0005
    weight_decay_bias: 0.0005
    warmup_factor: 0.01
    warmup_iters: 10
    warmup_method: linear
    base_lr: 0.00035
    momentum: 0.9
    center_loss_weight: 0.0005
    center_lr: 0.5
    triplet_loss_margin: 0.3
  num_epochs: 30
  checkpoint_interval: 10

• How to reproduce the issue ?

tao model re_identification export -e any.yaml

Hi,
I used TOA tookit 5.0.0 to retrain the re_identification model.

tao model re_identification train -e /workspace/tao-experiments/specs/experiment_spec_file.yaml -r /workspace/tao-experiments/results -k nvidia_tao

The volume mapping with the ~/.tao_mounts.json file also works fine. The training was done successful and I have a custom model.tlt file in my results. I want to export this model to an onnx file so I can run it in my deepstream pipeline’s sgie. Just like I do with the deployable version of the model.

However, evaluate, export and inference all fail due to missing file errors.

Traceback (most recent call last):
  File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/re_identification/scripts/export.py>", line 3, in <module>
  File "<frozen cv.re_identification.scripts.export>", line 150, in <module>
  File "<frozen core.hydra.hydra_runner>", line 107, in wrapper
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 389, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 452, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 296, in run_and_report
    raise ex
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 453, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.re_identification.scripts.export>", line 71, in main
  File "<frozen cv.re_identification.scripts.export>", line 58, in main
  File "<frozen core.utilities>", line 68, in update_results_dir
  File "/usr/lib/python3.8/posixpath.py", line 76, in join
    a = os.fspath(a)

Since nvidia_tao_pytorch is encrypted I can not determine what exactly causes this. However, it looks like the hydra schema to validate the export config is not found. This can also be reproduced by simply calling

tao model re_identification export -e any.yaml

The error occurs the moment an existing spec file is found. The provided spec file can even be empty. The other mandatory parameters can all be dropped.

error.log (3.8 KB)

Did you ever run https://github.com/NVIDIA/tao_tutorials/blob/main/notebooks/tao_launcher_starter_kit/re_identification_net/reidentificationnet.ipynb successfully?
It is also the same when you download the notebooks via below step.

wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tao/tao-getting-started/versions/5.0.0/zip -O getting_started_v5.0.0.zip
unzip -u getting_started_v5.0.0.zip  -d ./getting_started_v5.0.0 && rm -rf getting_started_v5.0.0.zip && cd ./getting_started_v5.0.0

Please refer to the export command in the bottom of the notebook.

Also, please refer to the yaml file in https://github.com/NVIDIA/tao_tutorials/blob/main/notebooks/tao_launcher_starter_kit/re_identification_net/specs/experiment_market1501.yaml

For example, try to add results_dir in your yaml.

Hi Morganh,

thanks for the fast response. The error is indeed caused by the missing results_dir in the yaml file.
This is not mentioned in the documentation. The documentation’s yaml file also contains other errors like train_config or val_dataset_dir.
https://docs.nvidia.com/tao/tao-toolkit/text/re_identification/re_identification.html#exporting-the-model
Since it is already a parameter, it is very un-intuitive to also add it in the yaml file.

It would be great if the the returned error would be more helpful (like it was for wrong values in the yaml for the train task) and the documentation would be updated.

Regards,
Bastian

Thanks for the info. We will improve the document.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.