Cannot run Dino with tao-5.3.0

Robert_Hoang · May 2, 2024, 10:40am

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) RTX 3080ti
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) DINO
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.3.0
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I am working to train Dino with my custom dataset, i follow the documentation from ngc and tao docs. After spend whole day, i still got several error like belows. Please help me to check it.

Specs

train:
  num_gpus: 1
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 2e-05
    lr: 2e-4
    lr_steps: [11]
    momentum: 0.9
  num_epochs: 12
dataset:
  train_data_sources:
    - image_dir: /ws/tao_trainer/data/dino/train/images
      json_file: /ws/tao_trainer/data/dino/train/train.json
  val_data_sources:
    - image_dir: /ws/tao_trainer/data/dino/valid/images
      json_file: /ws/tao_trainer/data/dino/valid/valid.json
  num_classes: 6
  batch_size: 4
  workers: 8
  augmentation:
    fixed_padding: False
model:
  backbone: fan_small
  train_backbone: True
  pretrained_backbone_path: /ws/tao_trainer/dino/fan_small_hybrid_nvimagenet.pth
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  num_select: 100
  dropout_ratio: 0.0
  dim_feedforward: 2048

Reproduce

docker run -it --rm --gpus all -v /home/tmp/Documents:/ws nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt dino train -e /ws/tao_trainer/dino/train.yml -r /ws/tao_trainer/dino/training_models -k threat_detection --gpus 1

===========================
=== TAO Toolkit PyTorch ===
===========================

NVIDIA Release 5.3.0-PyT (build 76438008)
TAO Toolkit Version 5.3.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement

WARNING: CUDA Minor Version Compatibility mode ENABLED.
  Using driver version 530.41.03 which has support for CUDA 12.1.  This container
  was built with CUDA 12.3 and will be run in Minor Version Compatibility mode.
  CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
  with this container but was unavailable:
  [[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TAO Toolkit.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

/usr/local/lib/python3.10/dist-packages/hydra/plugins/config_source.py:124: UserWarning: Support for .yml files is deprecated. Use .yaml extension for Hydra config files
  deprecation_warning(
Could not override 'results_dir'.
To append to your config use +results_dir=/ws/tao_trainer/dino/training_models
Key 'results_dir' is not in struct
    full_key: results_dir
    object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Execution status: FAIL

Robert_Hoang · May 2, 2024, 3:53pm

Hi @Morganh , i follow your help on another topic, but still got the same error in tao-5.0.0pyt

 docker run --runtime=nvidia -it -v /home/shaj/Documents:/ws nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt /bin/bash
dino train -e /ws/tao_trainer/dino/train.yml results_dir=/ws/tao_trainer/dino/training_models -k threat_detection

Could you please help me to check it? The sample annotation is attached (file format change from json to txt)
sample.txt (24.7 KB)

Morganh · May 2, 2024, 4:27pm

Could you use .yaml instead and retry?

Robert_Hoang · May 2, 2024, 4:35pm

Do you mean the command will look like below?

dino train -e .yml results_dir=/ws/tao_trainer/dino/training_models

The logs is below:

INFO: generated new fontManager
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
  warnings.warn(
<frozen importlib._bootstrap>:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
ERROR: The indicated experiment spec file `.yml` doesn't exist!

Morganh · May 2, 2024, 4:36pm

No, I mean you can change .yml to .yaml.

i.e.,
-e /ws/tao_trainer/dino/train.yaml

Robert_Hoang · May 2, 2024, 4:41pm

Hi @Morganh , Thank for your help, my bad, still got another issues, but it works

Robert_Hoang · May 3, 2024, 1:49am

Hi @Morganh , I can train DINO with my custom dataset, However, i got an error while training.
I create another topic. Looking for your help. Thanks

system · May 17, 2024, 1:50am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tao toolkit version5 is getting error when comes to training part TAO Toolkit	45	1708	August 22, 2023
Train.yaml Doesn't exist! TAO Toolkit	16	465	June 11, 2024
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm) TAO Toolkit	5	54	February 14, 2025
Nvidia tao pointpillars 'EasyDict' object has no attribute 'train' TAO Toolkit	2	179	May 22, 2024
DINO: Error executing job with overrides TAO Toolkit	12	789	May 28, 2024
DINO Training failed :: Default process group has not been initialized TAO Toolkit	5	754	October 3, 2023
Fine Tuning DINO Retail Object detector - error out as it expects unspecified/unknown configurations TAO Toolkit cudnn , retail-object-detection	6	39	December 30, 2024
TAO dino trianing tensorboard image visualization not working TAO Toolkit	5	80	August 9, 2024
TAO re_identification export failure TAO Toolkit	5	486	September 26, 2023
Error in TAO-Toolkit while training TAO Toolkit	15	1505	July 6, 2022

Cannot run Dino with tao-5.3.0

Specs

Reproduce

Related topics