I am currently facing an issue while training a re-identification net using the NVIDIA TAO Toolkit. I am encountering a “Missing mandatory value” error, specifically related to the dataset_config.train_dataset_dir key, even though it is present in my spec file.
I have verified the correctness of the spec file, double-checked the file paths, and ensured the dataset directory structure is appropriate.
• Hardware: NVIDIA GeForce RTX 4050 Laptop GPU
• NVIDIA GPU Driver Version: 525.125.06
• Network Type: ReIdentificationNet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): I’m using tao-toolkit 4.0.0-pyt
• Training spec file: Default spec file given in the documentation just changing the corresponding paths.
model_config:
backbone: resnet50
last_stride: 1
pretrain_choice: imagenet
pretrained_model_path: "/workspace/tao-experiments/models/resnet50_market1501_aicity156.tlt"
input_channels: 3
input_size: [256, 128]
neck: bnneck
feat_dim: 256
num_classes: 751
neck_feat: after
metric_loss_type: triplet
with_center_loss: False
with_flip_feature: False
label_smooth: True
train_config:
optim:
name: Adam
lr_monitor: str = "val_loss"
steps: [40, 70]
gamma: 0.1
bias_lr_factor: 1
weight_decay: 0.0005
weight_decay_bias: 0.0005
warmup_factor: 0.01
warmup_iters: 10
warmup_method: linear
base_lr: 0.00035
momentum: 0.9
center_loss_weight: 0.0005
center_lr: 0.5
triplet_loss_margin: 0.3
epochs: 120
checkpoint_interval: 10
dataset_config:
train_dataset_dir: "/workspace/tao-experiments/Dataset/bounding_box_train"
val_dataset_dir: "/workspace/tao-experiments/Dataset/bounding_box_test"
query_dataset_dir: "/workspace/tao-experiments/Dataset/query"
batch_size: 64
val_batch_size: 128
workers: 8
pixel_mean: [0.485, 0.456, 0.406]
pixel_std: [0.226, 0.226, 0.226]
padding: 10
prob: 0.5
re_prob: 0.5
sampler: softmax_triplet
num_instance: 4
re_ranking_config:
re_ranking: True
k1: 20
k2: 6
lambda_value: 0.3
• How to reproduce the issue ?:
-
Command:
tao re_identification train -r /workspace/tao-experiments/results/ -k nvidia_tao -e /workspace/tao-experiments/experiment.txt
-
Log:
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
ANTLR runtime and generated code versions disagree: 4.8!=4.9.3
Created a temporary directory at /tmp/tmph383uviz
Writing /tmp/tmph383uviz/_remote_module_non_scriptable.py
Error executing job with overrides: ['output_dir=/workspace/tao-experiments/results/', 'encryption_key=nvidia_tao']
An error occurred during Hydra's exception formatting:
AssertionError()
Traceback (most recent call last):
File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/cv/re_identification/scripts/train.py>", line 3, in <module>
File "<frozen cv.re_identification.scripts.train>", line 91, in <module>
File "<frozen cv.super_resolution.scripts.configs.hydra_runner>", line 99, in wrapper
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
run_and_report(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 294, in run_and_report
raise ex
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
lambda: hydra.run(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
_ = ret.return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "<frozen cv.re_identification.scripts.train>", line 85, in main
File "<frozen cv.re_identification.scripts.train>", line 37, in run_experiment
File "<frozen cv.re_identification.model.pl_reid_model>", line 44, in __init__
File "<frozen cv.re_identification.model.pl_reid_model>", line 67, in _build_model
File "/opt/conda/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 377, in __getitem__
self._format_and_raise(key=key, value=None, cause=e)
File "/opt/conda/lib/python3.8/site-packages/omegaconf/base.py", line 231, in _format_and_raise
format_and_raise(
File "/opt/conda/lib/python3.8/site-packages/omegaconf/_utils.py", line 873, in format_and_raise
_raise(ex, cause)
File "/opt/conda/lib/python3.8/site-packages/omegaconf/_utils.py", line 771, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace
File "/opt/conda/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 371, in __getitem__
return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
File "/opt/conda/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 453, in _get_impl
return self._resolve_with_default(
File "/opt/conda/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 96, in _resolve_with_default
raise MissingMandatoryValue("Missing mandatory value: $FULL_KEY")
omegaconf.errors.MissingMandatoryValue: Missing mandatory value: dataset_config.train_dataset_dir
full_key: dataset_config.train_dataset_dir
reference_type=ReIDDatasetConfig
object_type=ReIDDatasetConfig
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL