Tao toolkit version5 is getting error when comes to training part

Morganh · August 2, 2023, 3:44am

Could you share the exact reason to get it working? Thanks a lot.

anil.kumarp0255 · August 2, 2023, 3:49am

I think the exact reason is this training command tao model dino train -e $SPECS_DIR/train.yaml results_dir=$RESULTS_DIR/
i followed the command that you gave dino train -e $SPECS_DIR/train.yaml results_dir=$RESULTS_DIR/ with out tao and it will work only in this docker nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt as you mentioned.

Morganh · August 2, 2023, 4:40am

Yes, when run inside the docker, please run without “tao” in the beginning.

anil.kumarp0255 · August 2, 2023, 4:48am

I am having this error now could you please tell me the reason

Morganh · August 2, 2023, 4:50am

Did you run default notebook with default dataset mentioned in the notebook?

anil.kumarp0255 · August 2, 2023, 4:54am

I ran in notebook also but i used custom dataset same problem happening there.

Morganh · August 2, 2023, 5:00am

From https://github.com/NVIDIA/tao_pytorch_backend/blob/e5010af08121404dfb696152248467eee85ab3a7/nvidia_tao_pytorch/cv/deformable_detr/dataloader/serialized_dataset.py#L143 and https://github.com/NVIDIA/tao_pytorch_backend/blob/e5010af08121404dfb696152248467eee85ab3a7/nvidia_tao_pytorch/cv/deformable_detr/dataloader/serialized_dataset.py#L146C46-L146C46, there should be something mismatching in the dataset dict.
Please double check if the dataset exists or spec file is correctly set.

anil.kumarp0255 · August 2, 2023, 5:02am

ok thank you i will double check this part update .

anil.kumarp0255 · August 6, 2023, 11:41am

Hello Morganh, I have a dataset and annotations in kitti format which i used for yolov4 training with tao toolkit version4 now I am trying to convert that dataset to support dino which is coco fromat i used dino convert command from here DINO - NVIDIA Docs but getting the error as below.

root@afe6d276fa8d:/opt/nvidia/tools# dino convert -e /opt/nvidia/tools/tao-with-meta/Training/spec.yaml
INFO: generated new fontManager
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
sys:1: UserWarning:
‘spec.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
:107: UserWarning:
‘spec.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
Error merging ‘spec.yaml’ with schema
Key ‘output_dir’ not in ‘DINODatasetConvertConfig’
full_key: output_dir
object_type=DINODatasetConvertConfig

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Execution status: FAIL

Morganh · August 6, 2023, 4:30pm

Can you share the spec.yaml?

Morganh · August 6, 2023, 4:33pm

Please change to results_dir and retry.

Refer to
https://github.com/NVIDIA/tao_pytorch_backend/blob/main/nvidia_tao_pytorch/cv/dino/config/default_config.py#L28

anil.kumarp0255 · August 7, 2023, 9:41am

I have converted the dataset to suitable format but getting many issues related CUDA could you please tell me what could be the problem i am attaching the error log file.
Dino_training_error (38.1 KB)

anil.kumarp0255 · August 8, 2023, 2:55am

Hi morganh any updates ?

Morganh · August 8, 2023, 3:05am

Please set correct backbone since you are using fan_hybrid_tiny_nvimagenetv2.pth.tar.
Please change backbone to fan_tiny

Suggest you to download notebook and run it as getting started. TAO Toolkit Quick Start Guide - NVIDIA Docs
More info can be found in https://github.com/NVIDIA/tao_tutorials/blob/main/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml

anil.kumarp0255 · August 8, 2023, 3:36am

train.yaml (1.3 KB)
Sorry I updated “fan_tiny” already also same problem, and I downloaded the notebook also trained there and notebook is getting stuck during the command “dino train” here is training spec.

Morganh · August 8, 2023, 3:37am

Please share the full log as well. Thanks.

anil.kumarp0255 · August 8, 2023, 5:14am

Hi Morganh, I haved attached the full log of training from notebook please kindly check.
nbtrain_error (51.3 KB)

Morganh · August 10, 2023, 2:20am

anil.kumarp0255:

nbtrain_error (51.3 KB)

Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(“None of the inputs have requires_grad=True. Gradients will be None”)
/usr/local/lib/python3.8/dist-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3435.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [30,0,0], thread: [39,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [10,0,0], thread: [71,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [12,0,0], thread: [103,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [32,0,0], thread: [71,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [12,0,0], thread: [0,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

The error log is similar to DINO training gives error about insufficient shared memory (shm) - #9 by paul.doucet

Could you share the annotation file?

anil.kumarp0255 · August 10, 2023, 3:22am

Hi Morganh, I also solved that issue by starting category ID from 0

Morganh · August 10, 2023, 3:25am

Hi,
Please hold on. Below is the correct way.

If users train for n classes, then please set num_classes to n+1.

Also still make sure the categories starting from 1.

Topic		Replies	Views
DINO Training failed :: Default process group has not been initialized TAO Toolkit	5	754	October 3, 2023
Error in TAO-Toolkit while training TAO Toolkit	15	1505	July 6, 2022
Probelm as running visual_changenet_classification on TAO launcher TAO Toolkit	41	1026	November 21, 2023
Train.yaml Doesn't exist! TAO Toolkit	16	464	June 11, 2024
Classification_pyt error TAO Toolkit jetson	16	85	September 18, 2024
Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct TAO Toolkit deepstream	70	1684	July 10, 2023
DINO: Error executing job with overrides TAO Toolkit	12	785	May 28, 2024
Tao Text Classification Evaluate failing TAO Toolkit	18	1360	October 12, 2021
LPRNet Error TAO Toolkit	13	227	June 19, 2024
Tao Text Classification Evaluate failing TAO Toolkit tao	5	1351	October 12, 2021

Tao toolkit version5 is getting error when comes to training part

Related topics