Missing Value Error in Training Re-Identification Net with TAO Toolkit

I am currently facing an issue while training a re-identification net using the NVIDIA TAO Toolkit. I am encountering a “Missing mandatory value” error, specifically related to the dataset_config.train_dataset_dir key, even though it is present in my spec file.

I have verified the correctness of the spec file, double-checked the file paths, and ensured the dataset directory structure is appropriate.

• Hardware: NVIDIA GeForce RTX 4050 Laptop GPU
• NVIDIA GPU Driver Version: 525.125.06
• Network Type: ReIdentificationNet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): I’m using tao-toolkit 4.0.0-pyt
• Training spec file: Default spec file given in the documentation just changing the corresponding paths.

model_config:
  backbone: resnet50
  last_stride: 1
  pretrain_choice: imagenet
  pretrained_model_path: "/workspace/tao-experiments/models/resnet50_market1501_aicity156.tlt"
  input_channels: 3
  input_size: [256, 128]
  neck: bnneck
  feat_dim: 256
  num_classes: 751
  neck_feat: after
  metric_loss_type: triplet
  with_center_loss: False
  with_flip_feature: False
  label_smooth: True
train_config:
  optim:
    name: Adam
    lr_monitor: str = "val_loss"
    steps: [40, 70]
    gamma: 0.1
    bias_lr_factor: 1
    weight_decay: 0.0005
    weight_decay_bias: 0.0005
    warmup_factor: 0.01
    warmup_iters: 10
    warmup_method: linear
    base_lr: 0.00035
    momentum: 0.9
    center_loss_weight: 0.0005
    center_lr: 0.5
    triplet_loss_margin: 0.3
  epochs: 120
  checkpoint_interval: 10
dataset_config:
  train_dataset_dir: "/workspace/tao-experiments/Dataset/bounding_box_train"
  val_dataset_dir: "/workspace/tao-experiments/Dataset/bounding_box_test"
  query_dataset_dir: "/workspace/tao-experiments/Dataset/query"
  batch_size: 64
  val_batch_size: 128
  workers: 8
  pixel_mean: [0.485, 0.456, 0.406]
  pixel_std: [0.226, 0.226, 0.226]
  padding: 10
  prob: 0.5
  re_prob: 0.5
  sampler: softmax_triplet
  num_instance: 4
re_ranking_config:
  re_ranking: True
  k1: 20
  k2: 6
  lambda_value: 0.3

• How to reproduce the issue ?:

  • Command:
    tao re_identification train -r /workspace/tao-experiments/results/ -k nvidia_tao -e /workspace/tao-experiments/experiment.txt

  • Log:

ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
Created a temporary directory at /tmp/tmph383uviz
Writing /tmp/tmph383uviz/_remote_module_non_scriptable.py
Error executing job with overrides: ['output_dir=/workspace/tao-experiments/results/', 'encryption_key=nvidia_tao']
An error occurred during Hydra's exception formatting:
AssertionError()
Traceback (most recent call last):
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/cv/re_identification/scripts/train.py>", line 3, in <module>
  File "<frozen cv.re_identification.scripts.train>", line 91, in <module>
  File "<frozen cv.super_resolution.scripts.configs.hydra_runner>", line 99, in wrapper
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 294, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.re_identification.scripts.train>", line 85, in main
  File "<frozen cv.re_identification.scripts.train>", line 37, in run_experiment
  File "<frozen cv.re_identification.model.pl_reid_model>", line 44, in __init__
  File "<frozen cv.re_identification.model.pl_reid_model>", line 67, in _build_model
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 377, in __getitem__
    self._format_and_raise(key=key, value=None, cause=e)
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/_utils.py", line 873, in format_and_raise
    _raise(ex, cause)
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/_utils.py", line 771, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 371, in __getitem__
    return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 453, in _get_impl
    return self._resolve_with_default(
  File "/opt/conda/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 96, in _resolve_with_default
    raise MissingMandatoryValue("Missing mandatory value: $FULL_KEY")
omegaconf.errors.MissingMandatoryValue: Missing mandatory value: dataset_config.train_dataset_dir
    full_key: dataset_config.train_dataset_dir
    reference_type=ReIDDatasetConfig
    object_type=ReIDDatasetConfig
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

To narrow down, how about
tao re_identification train -r /workspace/tao-experiments/results/ -k nvidia_tao -e /workspace/tao-experiments/experiment.txt dataset_config.train_dataset_dir="/workspace/tao-experiments/Dataset/bounding_box_train"

Thank you for answering.
I tried the command that you said and I’m getting a new error, so progress I guess.

This is the new error:

ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
Created a temporary directory at /tmp/tmpvcgucnwj
Writing /tmp/tmpvcgucnwj/_remote_module_non_scriptable.py
Error executing job with overrides: ['output_dir=/workspace/tao-experiments/results/', 'encryption_key=nvidia_tao', 'dataset_config.train_dataset_dir=/workspace/tao-experiments/Dataset/bounding_box_train']
An error occurred during Hydra's exception formatting:
AssertionError()
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 252, in run_and_report
    assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/cv/re_identification/scripts/train.py>", line 3, in <module>
  File "<frozen cv.re_identification.scripts.train>", line 91, in <module>
  File "<frozen cv.super_resolution.scripts.configs.hydra_runner>", line 99, in wrapper
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 294, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.re_identification.scripts.train>", line 85, in main
  File "<frozen cv.re_identification.scripts.train>", line 37, in run_experiment
  File "<frozen cv.re_identification.model.pl_reid_model>", line 44, in __init__
  File "<frozen cv.re_identification.model.pl_reid_model>", line 68, in _build_model
  File "<frozen cv.re_identification.model.pl_reid_model>", line 256, in __process_dir
AttributeError: 'NoneType' object has no attribute 'groups'
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

I see. Please use .yaml instead of .txt file.

I’ve tried using the yaml and it gave me this error

ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
[NeMo W 2023-07-07 06:33:07 nemo_logging:349] <frozen cv.re_identification.scripts.train>:91: UserWarning: 
    'experiment.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
Error merging 'experiment.yaml' with schema
Key 'val_dataset_dir' not in 'ReIDDatasetConfig'
    full_key: dataset_config.val_dataset_dir
    reference_type=ReIDDatasetConfig
    object_type=ReIDDatasetConfig

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

I’ve already fixed this, instead of val_dataset_dir I need to use test_dataset_dir. This confusion happened because there’s a discordance in the documentation in the example they use val and in the explanation they use test.

After fixing it this is the new error that I’m getting:

ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
[NeMo W 2023-07-07 06:37:47 nemo_logging:349] <frozen cv.re_identification.scripts.train>:91: UserWarning: 
    'experiment.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
Created a temporary directory at /tmp/tmpz52y20p7
Writing /tmp/tmpz52y20p7/_remote_module_non_scriptable.py
Error executing job with overrides: ['output_dir=/workspace/tao-experiments/results/', 'encryption_key=nvidia_tao', 'dataset_config.train_dataset_dir=/workspace/tao-experiments/Dataset/bounding_box_train']
An error occurred during Hydra's exception formatting:
AssertionError()
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 252, in run_and_report
    assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/cv/re_identification/scripts/train.py>", line 3, in <module>
  File "<frozen cv.re_identification.scripts.train>", line 91, in <module>
  File "<frozen cv.super_resolution.scripts.configs.hydra_runner>", line 99, in wrapper
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 294, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.re_identification.scripts.train>", line 85, in main
  File "<frozen cv.re_identification.scripts.train>", line 37, in run_experiment
  File "<frozen cv.re_identification.model.pl_reid_model>", line 44, in __init__
  File "<frozen cv.re_identification.model.pl_reid_model>", line 68, in _build_model
  File "<frozen cv.re_identification.model.pl_reid_model>", line 256, in __process_dir
AttributeError: 'NoneType' object has no attribute 'groups'
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

By the way, using yaml instead of txt fixed the Missing Value Error I can now run the command without dataset_config.train_dataset_dir

Are there .jpg files inside your train_dataset_dir(/workspace/tao-experiments/Dataset/bounding_box_train) ?

Yes there’re

Can you share result of
! tao re_identification run ll -rlt /workspace/tao-experiments/Dataset/bounding_box_train

usage: re_identification [-h] [-r RESULTS_DIR] [-k KEY] [-e EXPERIMENT_SPEC_FILE] {evaluate,export,inference,train}
re_identification: error: argument subtask: invalid choice: 'run' (choose from 'evaluate', 'export', 'inference', 'train')

Can you share your ~/.tao_mounts.json file?

I don’t have any file named tao_mounts.json
By the way, I’m working through a dev container from the tao 4.0.0-pyt image, if that’s helpful.

You are running in a jupyter notebook, right?
Is it GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC ?

There is a tao_mounts.json file to map your local files into docker.

I’m not running in a jupyter notebook, I’m running in a laptop with a dGPU and I have mounted my local files into docker.

This is how I mount into the dev-container:

    "image": "train-dgpu:tao-tf",
    "remoteUser": "trainer",
    "workspaceMount": "type=bind,source=${localWorkspaceFolder},target=/workspace/tao-experiments",
    "workspaceFolder": "/workspace/tao-experiments",
    "runArgs": [
        "--rm",
        "--network=host",
        "--gpus",
        "all",
        "--privileged",
        "-v",
        "/dev/shm:/dev/shm",
        "--cap-add=SYSLOG",
        "-e",
        "DISPLAY=${localEnv:DISPLAY}",
        "-e",
        "CUDA_CACHE_DISABLE=0",
        "-e",
        "CUDA_VISIBLE_DEVICES=0"
    ],

Please follow ReIdentificationNet - NVIDIA Docs and also the jpg files name should contain below patterns.
pattern = re.compile(r'([-\d]+)_c(\d)')

For example, 0002_c1.jpg

Aren’t you missing the frame information??

The documentation says the following:

The root directory of the dataset contains sub-directories for training, testing, and query. Each sub-directory has the cropped images of different identities. For example, the image 0001_c1s1_01_00.jpg is from the first sequence s1 of camera c1. 01 indicates the first frame in the sequence c1s1 . 0001 is the unique ID assigned to the object. The contents after the third _ are ignored.

This is an example from my bounding_box_train:
0001_L09C02_00000001.jpg
0001 → unique ID
L09C02 → name of camera
00000001 → frame

Do I need to rename the camera to a format cXsY (e.g c1s1) ?

c is needed.

For your case, below is ok.
0001_c09C02_00000001.jpg
or
0001_c1s1L09C02_00000001.jpg
etc.

I tried 0001_c09C02_00000001.jpg and I got this error AssertionError: The number of camera IDs should be between 0 and 6.

Then I tried using the exact format from the documentation (e.g 0001_c1s1_00000001.jpg) and got this error:

ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
[NeMo W 2023-07-07 09:30:54 nemo_logging:349] <frozen cv.re_identification.scripts.train>:91: UserWarning: 
    'experiment.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
Created a temporary directory at /tmp/tmp_1qvwiax
Writing /tmp/tmp_1qvwiax/_remote_module_non_scriptable.py
Error executing job with overrides: ['output_dir=/workspace/tao-experiments/results/retrained_models/2023-07-07_09-30-43/', 'encryption_key=nvidia_tao']
An error occurred during Hydra's exception formatting:
AssertionError()
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 252, in run_and_report
    assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/cv/re_identification/scripts/train.py>", line 3, in <module>
  File "<frozen cv.re_identification.scripts.train>", line 91, in <module>
  File "<frozen cv.super_resolution.scripts.configs.hydra_runner>", line 99, in wrapper
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 294, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.re_identification.scripts.train>", line 85, in main
  File "<frozen cv.re_identification.scripts.train>", line 37, in run_experiment
  File "<frozen cv.re_identification.model.pl_reid_model>", line 44, in __init__
  File "<frozen cv.re_identification.model.pl_reid_model>", line 71, in _build_model
  File "<frozen cv.re_identification.model.build_nn_model>", line 17, in build_model
  File "<frozen cv.re_identification.model.baseline>", line 92, in __init__
  File "<frozen cv.re_identification.model.resnet>", line 230, in load_param
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 734, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 1071, in _load
    result = unpickler.load()
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 1064, in find_class
    return super().find_class(mod_name, name)
AttributeError: Can't get attribute 'ExperimentConfig' on <module 'nvidia_tao_pytorch.cv.re_identification.config.default_config' from '/opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/cv/re_identification/config/default_config.py'>
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

The range is limited to below.
0001_c1xxxx_xxxxx.jpg
0002_c1xxxx_xxxxx.jpg

1500_c1xxxx_xxxxx.jpg
1501_c1xxxx_xxxxx.jpg
0001_c2xxxx_xxxxx.jpg
0002_c2xxxx_xxxxx.jpg

1500_c2xxxx_xxxxx.jpg
1501_c2xxxx_xxxxx.jpg
0001_c3xxxx_xxxxx.jpg
0002_c3xxxx_xxxxx.jpg


1500_c6xxxx_xxxxx.jpg
1501_c6xxxx_xxxxx.jpg

For the 2nd issue, could you please run default notebook to check if it can be reproduced?

I can’t seem to find the notebook for ReIdentificationNet, where can I find it?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Refer to TAO Toolkit Quick Start Guide - NVIDIA Docs
GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC

or

wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tao/tao-getting-started/versions/4.0.2/zip -O getting_started_v4.0.2.zip
unzip -u getting_started_v4.0.2.zip  -d ./getting_started_v4.0.2 && rm -rf getting_started_v4.0.2.zip && cd ./getting_started_v4.0.2