Tao toolkit version5 is getting error when comes to training part

Could you share the exact reason to get it working? Thanks a lot.

I think the exact reason is this training command tao model dino train -e $SPECS_DIR/train.yaml results_dir=$RESULTS_DIR/
i followed the command that you gave dino train -e $SPECS_DIR/train.yaml results_dir=$RESULTS_DIR/ with out tao and it will work only in this docker nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt as you mentioned.

Yes, when run inside the docker, please run without “tao” in the beginning.

I am having this error now could you please tell me the reason

Did you run default notebook with default dataset mentioned in the notebook?

I ran in notebook also but i used custom dataset same problem happening there.

From https://github.com/NVIDIA/tao_pytorch_backend/blob/e5010af08121404dfb696152248467eee85ab3a7/nvidia_tao_pytorch/cv/deformable_detr/dataloader/serialized_dataset.py#L143 and https://github.com/NVIDIA/tao_pytorch_backend/blob/e5010af08121404dfb696152248467eee85ab3a7/nvidia_tao_pytorch/cv/deformable_detr/dataloader/serialized_dataset.py#L146C46-L146C46, there should be something mismatching in the dataset dict.
Please double check if the dataset exists or spec file is correctly set.

ok thank you i will double check this part update .

Hello Morganh, I have a dataset and annotations in kitti format which i used for yolov4 training with tao toolkit version4 now I am trying to convert that dataset to support dino which is coco fromat i used dino convert command from here DINO - NVIDIA Docs but getting the error as below.

root@afe6d276fa8d:/opt/nvidia/tools# dino convert -e /opt/nvidia/tools/tao-with-meta/Training/spec.yaml
INFO: generated new fontManager
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
sys:1: UserWarning:
‘spec.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
:107: UserWarning:
‘spec.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
Error merging ‘spec.yaml’ with schema
Key ‘output_dir’ not in ‘DINODatasetConvertConfig’
full_key: output_dir
object_type=DINODatasetConvertConfig

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Execution status: FAIL

Can you share the spec.yaml?

Please change to results_dir and retry.

Refer to
https://github.com/NVIDIA/tao_pytorch_backend/blob/main/nvidia_tao_pytorch/cv/dino/config/default_config.py#L28

I have converted the dataset to suitable format but getting many issues related CUDA could you please tell me what could be the problem i am attaching the error log file.
Dino_training_error (38.1 KB)

Hi morganh any updates ?

Please set correct backbone since you are using fan_hybrid_tiny_nvimagenetv2.pth.tar.
Please change backbone to fan_tiny

Suggest you to download notebook and run it as getting started. TAO Toolkit Quick Start Guide - NVIDIA Docs
More info can be found in https://github.com/NVIDIA/tao_tutorials/blob/main/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml

train.yaml (1.3 KB)
Sorry I updated “fan_tiny” already also same problem, and I downloaded the notebook also trained there and notebook is getting stuck during the command “dino train” here is training spec.

Please share the full log as well. Thanks.

Hi Morganh, I haved attached the full log of training from notebook please kindly check.
nbtrain_error (51.3 KB)

The error log is similar to DINO training gives error about insufficient shared memory (shm) - #9 by paul.doucet

Could you share the annotation file?

Hi Morganh, I also solved that issue by starting category ID from 0

Hi,
Please hold on. Below is the correct way.

If users train for n classes, then please set num_classes to n+1.

Also still make sure the categories starting from 1.