Missing Value Error in Training Re-Identification Net with TAO Toolkit

jwme · July 6, 2023, 12:31pm

I am currently facing an issue while training a re-identification net using the NVIDIA TAO Toolkit. I am encountering a “Missing mandatory value” error, specifically related to the dataset_config.train_dataset_dir key, even though it is present in my spec file.

I have verified the correctness of the spec file, double-checked the file paths, and ensured the dataset directory structure is appropriate.

• Hardware: NVIDIA GeForce RTX 4050 Laptop GPU
• NVIDIA GPU Driver Version: 525.125.06
• Network Type: ReIdentificationNet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): I’m using tao-toolkit 4.0.0-pyt
• Training spec file: Default spec file given in the documentation just changing the corresponding paths.

model_config:
  backbone: resnet50
  last_stride: 1
  pretrain_choice: imagenet
  pretrained_model_path: "/workspace/tao-experiments/models/resnet50_market1501_aicity156.tlt"
  input_channels: 3
  input_size: [256, 128]
  neck: bnneck
  feat_dim: 256
  num_classes: 751
  neck_feat: after
  metric_loss_type: triplet
  with_center_loss: False
  with_flip_feature: False
  label_smooth: True
train_config:
  optim:
    name: Adam
    lr_monitor: str = "val_loss"
    steps: [40, 70]
    gamma: 0.1
    bias_lr_factor: 1
    weight_decay: 0.0005
    weight_decay_bias: 0.0005
    warmup_factor: 0.01
    warmup_iters: 10
    warmup_method: linear
    base_lr: 0.00035
    momentum: 0.9
    center_loss_weight: 0.0005
    center_lr: 0.5
    triplet_loss_margin: 0.3
  epochs: 120
  checkpoint_interval: 10
dataset_config:
  train_dataset_dir: "/workspace/tao-experiments/Dataset/bounding_box_train"
  val_dataset_dir: "/workspace/tao-experiments/Dataset/bounding_box_test"
  query_dataset_dir: "/workspace/tao-experiments/Dataset/query"
  batch_size: 64
  val_batch_size: 128
  workers: 8
  pixel_mean: [0.485, 0.456, 0.406]
  pixel_std: [0.226, 0.226, 0.226]
  padding: 10
  prob: 0.5
  re_prob: 0.5
  sampler: softmax_triplet
  num_instance: 4
re_ranking_config:
  re_ranking: True
  k1: 20
  k2: 6
  lambda_value: 0.3

• How to reproduce the issue ?:

Command:
tao re_identification train -r /workspace/tao-experiments/results/ -k nvidia_tao -e /workspace/tao-experiments/experiment.txt
Log:

ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
Created a temporary directory at /tmp/tmph383uviz
Writing /tmp/tmph383uviz/_remote_module_non_scriptable.py
Error executing job with overrides: ['output_dir=/workspace/tao-experiments/results/', 'encryption_key=nvidia_tao']
An error occurred during Hydra's exception formatting:
AssertionError()
Traceback (most recent call last):
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/cv/re_identification/scripts/train.py>", line 3, in <module>
  File "<frozen cv.re_identification.scripts.train>", line 91, in <module>
  File "<frozen cv.super_resolution.scripts.configs.hydra_runner>", line 99, in wrapper
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 294, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.re_identification.scripts.train>", line 85, in main
  File "<frozen cv.re_identification.scripts.train>", line 37, in run_experiment
  File "<frozen cv.re_identification.model.pl_reid_model>", line 44, in __init__
  File "<frozen cv.re_identification.model.pl_reid_model>", line 67, in _build_model
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 377, in __getitem__
    self._format_and_raise(key=key, value=None, cause=e)
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/_utils.py", line 873, in format_and_raise
    _raise(ex, cause)
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/_utils.py", line 771, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 371, in __getitem__
    return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 453, in _get_impl
    return self._resolve_with_default(
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 96, in _resolve_with_default
    raise MissingMandatoryValue("Missing mandatory value: $FULL_KEY")
omegaconf.errors.MissingMandatoryValue: Missing mandatory value: dataset_config.train_dataset_dir
    full_key: dataset_config.train_dataset_dir
    reference_type=ReIDDatasetConfig
    object_type=ReIDDatasetConfig
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

Morganh · July 6, 2023, 2:26pm

To narrow down, how about
tao re_identification train -r /workspace/tao-experiments/results/ -k nvidia_tao -e /workspace/tao-experiments/experiment.txt dataset_config.train_dataset_dir="/workspace/tao-experiments/Dataset/bounding_box_train"

jwme · July 6, 2023, 2:51pm

Thank you for answering.
I tried the command that you said and I’m getting a new error, so progress I guess.

This is the new error:

ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
Created a temporary directory at /tmp/tmpvcgucnwj
Writing /tmp/tmpvcgucnwj/_remote_module_non_scriptable.py
Error executing job with overrides: ['output_dir=/workspace/tao-experiments/results/', 'encryption_key=nvidia_tao', 'dataset_config.train_dataset_dir=/workspace/tao-experiments/Dataset/bounding_box_train']
An error occurred during Hydra's exception formatting:
AssertionError()
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 252, in run_and_report
    assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/cv/re_identification/scripts/train.py>", line 3, in <module>
  File "<frozen cv.re_identification.scripts.train>", line 91, in <module>
  File "<frozen cv.super_resolution.scripts.configs.hydra_runner>", line 99, in wrapper
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 294, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.re_identification.scripts.train>", line 85, in main
  File "<frozen cv.re_identification.scripts.train>", line 37, in run_experiment
  File "<frozen cv.re_identification.model.pl_reid_model>", line 44, in __init__
  File "<frozen cv.re_identification.model.pl_reid_model>", line 68, in _build_model
  File "<frozen cv.re_identification.model.pl_reid_model>", line 256, in __process_dir
AttributeError: 'NoneType' object has no attribute 'groups'
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

Morganh · July 6, 2023, 3:10pm

I see. Please use .yaml instead of .txt file.

jwme · July 7, 2023, 6:41am

I’ve tried using the yaml and it gave me this error

ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
[NeMo W 2023-07-07 06:33:07 nemo_logging:349] <frozen cv.re_identification.scripts.train>:91: UserWarning: 
    'experiment.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
Error merging 'experiment.yaml' with schema
Key 'val_dataset_dir' not in 'ReIDDatasetConfig'
    full_key: dataset_config.val_dataset_dir
    reference_type=ReIDDatasetConfig
    object_type=ReIDDatasetConfig

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

I’ve already fixed this, instead of val_dataset_dir I need to use test_dataset_dir. This confusion happened because there’s a discordance in the documentation in the example they use val and in the explanation they use test.

After fixing it this is the new error that I’m getting:

ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
[NeMo W 2023-07-07 06:37:47 nemo_logging:349] <frozen cv.re_identification.scripts.train>:91: UserWarning: 
    'experiment.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
Created a temporary directory at /tmp/tmpz52y20p7
Writing /tmp/tmpz52y20p7/_remote_module_non_scriptable.py
Error executing job with overrides: ['output_dir=/workspace/tao-experiments/results/', 'encryption_key=nvidia_tao', 'dataset_config.train_dataset_dir=/workspace/tao-experiments/Dataset/bounding_box_train']
An error occurred during Hydra's exception formatting:
AssertionError()
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 252, in run_and_report
    assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/cv/re_identification/scripts/train.py>", line 3, in <module>
  File "<frozen cv.re_identification.scripts.train>", line 91, in <module>
  File "<frozen cv.super_resolution.scripts.configs.hydra_runner>", line 99, in wrapper
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 294, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.re_identification.scripts.train>", line 85, in main
  File "<frozen cv.re_identification.scripts.train>", line 37, in run_experiment
  File "<frozen cv.re_identification.model.pl_reid_model>", line 44, in __init__
  File "<frozen cv.re_identification.model.pl_reid_model>", line 68, in _build_model
  File "<frozen cv.re_identification.model.pl_reid_model>", line 256, in __process_dir
AttributeError: 'NoneType' object has no attribute 'groups'
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

By the way, using yaml instead of txt fixed the Missing Value Error I can now run the command without dataset_config.train_dataset_dir

Morganh · July 7, 2023, 6:57am

Are there .jpg files inside your train_dataset_dir(/workspace/tao-experiments/Dataset/bounding_box_train) ?

jwme · July 7, 2023, 7:06am

Yes there’re

Morganh · July 7, 2023, 7:10am

Can you share result of
! tao re_identification run ll -rlt /workspace/tao-experiments/Dataset/bounding_box_train

jwme · July 7, 2023, 7:19am

usage: re_identification [-h] [-r RESULTS_DIR] [-k KEY] [-e EXPERIMENT_SPEC_FILE] {evaluate,export,inference,train}
re_identification: error: argument subtask: invalid choice: 'run' (choose from 'evaluate', 'export', 'inference', 'train')

Morganh · July 7, 2023, 7:25am

Can you share your ~/.tao_mounts.json file?

jwme · July 7, 2023, 7:36am

I don’t have any file named tao_mounts.json
By the way, I’m working through a dev container from the tao 4.0.0-pyt image, if that’s helpful.

Morganh · July 7, 2023, 7:44am

You are running in a jupyter notebook, right?
Is it GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC ?

There is a tao_mounts.json file to map your local files into docker.

jwme · July 7, 2023, 7:54am

I’m not running in a jupyter notebook, I’m running in a laptop with a dGPU and I have mounted my local files into docker.

This is how I mount into the dev-container:

    "image": "train-dgpu:tao-tf",
    "remoteUser": "trainer",
    "workspaceMount": "type=bind,source=${localWorkspaceFolder},target=/workspace/tao-experiments",
    "workspaceFolder": "/workspace/tao-experiments",
    "runArgs": [
        "--rm",
        "--network=host",
        "--gpus",
        "all",
        "--privileged",
        "-v",
        "/dev/shm:/dev/shm",
        "--cap-add=SYSLOG",
        "-e",
        "DISPLAY=${localEnv:DISPLAY}",
        "-e",
        "CUDA_CACHE_DISABLE=0",
        "-e",
        "CUDA_VISIBLE_DEVICES=0"
    ],

Morganh · July 7, 2023, 8:29am

Please follow ReIdentificationNet - NVIDIA Docs and also the jpg files name should contain below patterns.
pattern = re.compile(r'([-\d]+)_c(\d)')

For example, 0002_c1.jpg

jwme · July 7, 2023, 8:38am

Aren’t you missing the frame information??

The documentation says the following:

The root directory of the dataset contains sub-directories for training, testing, and query. Each sub-directory has the cropped images of different identities. For example, the image 0001_c1s1_01_00.jpg is from the first sequence s1 of camera c1. 01 indicates the first frame in the sequence c1s1 . 0001 is the unique ID assigned to the object. The contents after the third _ are ignored.

This is an example from my bounding_box_train:
0001_L09C02_00000001.jpg
0001 → unique ID
L09C02 → name of camera
00000001 → frame

Do I need to rename the camera to a format cXsY (e.g c1s1) ?

Morganh · July 7, 2023, 8:48am

c is needed.

For your case, below is ok.
0001_c09C02_00000001.jpg
or
0001_c1s1L09C02_00000001.jpg
etc.

jwme · July 7, 2023, 9:43am

I tried 0001_c09C02_00000001.jpg and I got this error AssertionError: The number of camera IDs should be between 0 and 6.

Then I tried using the exact format from the documentation (e.g 0001_c1s1_00000001.jpg) and got this error:

ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
[NeMo W 2023-07-07 09:30:54 nemo_logging:349] <frozen cv.re_identification.scripts.train>:91: UserWarning: 
    'experiment.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
Created a temporary directory at /tmp/tmp_1qvwiax
Writing /tmp/tmp_1qvwiax/_remote_module_non_scriptable.py
Error executing job with overrides: ['output_dir=/workspace/tao-experiments/results/retrained_models/2023-07-07_09-30-43/', 'encryption_key=nvidia_tao']
An error occurred during Hydra's exception formatting:
AssertionError()
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 252, in run_and_report
    assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/cv/re_identification/scripts/train.py>", line 3, in <module>
  File "<frozen cv.re_identification.scripts.train>", line 91, in <module>
  File "<frozen cv.super_resolution.scripts.configs.hydra_runner>", line 99, in wrapper
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 294, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.re_identification.scripts.train>", line 85, in main
  File "<frozen cv.re_identification.scripts.train>", line 37, in run_experiment
  File "<frozen cv.re_identification.model.pl_reid_model>", line 44, in __init__
  File "<frozen cv.re_identification.model.pl_reid_model>", line 71, in _build_model
  File "<frozen cv.re_identification.model.build_nn_model>", line 17, in build_model
  File "<frozen cv.re_identification.model.baseline>", line 92, in __init__
  File "<frozen cv.re_identification.model.resnet>", line 230, in load_param
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 734, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 1071, in _load
    result = unpickler.load()
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 1064, in find_class
    return super().find_class(mod_name, name)
AttributeError: Can't get attribute 'ExperimentConfig' on <module 'nvidia_tao_pytorch.cv.re_identification.config.default_config' from '/opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/cv/re_identification/config/default_config.py'>
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

Morganh · July 7, 2023, 5:14pm

The range is limited to below.
0001_c1xxxx_xxxxx.jpg
0002_c1xxxx_xxxxx.jpg
…
1500_c1xxxx_xxxxx.jpg
1501_c1xxxx_xxxxx.jpg
0001_c2xxxx_xxxxx.jpg
0002_c2xxxx_xxxxx.jpg
…
1500_c2xxxx_xxxxx.jpg
1501_c2xxxx_xxxxx.jpg
0001_c3xxxx_xxxxx.jpg
0002_c3xxxx_xxxxx.jpg
…
…
1500_c6xxxx_xxxxx.jpg
1501_c6xxxx_xxxxx.jpg

For the 2nd issue, could you please run default notebook to check if it can be reproduced?

jwme · July 10, 2023, 7:08am

I can’t seem to find the notebook for ReIdentificationNet, where can I find it?

Morganh · July 10, 2023, 7:12am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Refer to TAO Toolkit Quick Start Guide - NVIDIA Docs
GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC

or

wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tao/tao-getting-started/versions/4.0.2/zip -O getting_started_v4.0.2.zip
unzip -u getting_started_v4.0.2.zip  -d ./getting_started_v4.0.2 && rm -rf getting_started_v4.0.2.zip && cd ./getting_started_v4.0.2

Topic		Replies	Views
Probelm as running visual_changenet_classification on TAO launcher TAO Toolkit	41	1033	November 21, 2023
Error in TAO-Toolkit while training TAO Toolkit	15	1511	July 6, 2022
Error in TAO-toolkit classification_tf2 train TAO Toolkit	21	566	January 26, 2024
Erorr when training the model using TAO for custom action recognitinon net TAO Toolkit	21	641	July 4, 2023
Tao toolkit version5 is getting error when comes to training part TAO Toolkit	45	1718	August 22, 2023
Classification_pyt error TAO Toolkit jetson	16	96	September 18, 2024
Train.yaml Doesn't exist! TAO Toolkit	16	483	June 11, 2024
Tao Text Classification Evaluate failing TAO Toolkit tao	5	1352	October 12, 2021
Tao toolkit observations TAO Toolkit	56	947	May 29, 2024
TAO re_identification export failure TAO Toolkit	5	486	September 26, 2023

Missing Value Error in Training Re-Identification Net with TAO Toolkit

Related topics