Cannot run Dino with tao-5.3.0

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) RTX 3080ti
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) DINO
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.3.0
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I am working to train Dino with my custom dataset, i follow the documentation from ngc and tao docs. After spend whole day, i still got several error like belows. Please help me to check it.

Specs

train:
  num_gpus: 1
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 2e-05
    lr: 2e-4
    lr_steps: [11]
    momentum: 0.9
  num_epochs: 12
dataset:
  train_data_sources:
    - image_dir: /ws/tao_trainer/data/dino/train/images
      json_file: /ws/tao_trainer/data/dino/train/train.json
  val_data_sources:
    - image_dir: /ws/tao_trainer/data/dino/valid/images
      json_file: /ws/tao_trainer/data/dino/valid/valid.json
  num_classes: 6
  batch_size: 4
  workers: 8
  augmentation:
    fixed_padding: False
model:
  backbone: fan_small
  train_backbone: True
  pretrained_backbone_path: /ws/tao_trainer/dino/fan_small_hybrid_nvimagenet.pth
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  num_select: 100
  dropout_ratio: 0.0
  dim_feedforward: 2048

Reproduce

docker run -it --rm --gpus all -v /home/tmp/Documents:/ws nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt dino train -e /ws/tao_trainer/dino/train.yml -r /ws/tao_trainer/dino/training_models -k threat_detection --gpus 1

===========================
=== TAO Toolkit PyTorch ===
===========================

NVIDIA Release 5.3.0-PyT (build 76438008)
TAO Toolkit Version 5.3.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement

WARNING: CUDA Minor Version Compatibility mode ENABLED.
  Using driver version 530.41.03 which has support for CUDA 12.1.  This container
  was built with CUDA 12.3 and will be run in Minor Version Compatibility mode.
  CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
  with this container but was unavailable:
  [[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TAO Toolkit.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

/usr/local/lib/python3.10/dist-packages/hydra/plugins/config_source.py:124: UserWarning: Support for .yml files is deprecated. Use .yaml extension for Hydra config files
  deprecation_warning(
Could not override 'results_dir'.
To append to your config use +results_dir=/ws/tao_trainer/dino/training_models
Key 'results_dir' is not in struct
    full_key: results_dir
    object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Execution status: FAIL

Hi @Morganh , i follow your help on another topic, but still got the same error in tao-5.0.0pyt

 docker run --runtime=nvidia -it -v /home/shaj/Documents:/ws nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt /bin/bash
dino train -e /ws/tao_trainer/dino/train.yml results_dir=/ws/tao_trainer/dino/training_models -k threat_detection

Could you please help me to check it? The sample annotation is attached (file format change from json to txt)
sample.txt (24.7 KB)

Could you use .yaml instead and retry?

Do you mean the command will look like below?

dino train -e .yml results_dir=/ws/tao_trainer/dino/training_models

The logs is below:

INFO: generated new fontManager
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
  warnings.warn(
<frozen importlib._bootstrap>:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
ERROR: The indicated experiment spec file `.yml` doesn't exist!

No, I mean you can change .yml to .yaml.

i.e.,
-e /ws/tao_trainer/dino/train.yaml

1 Like

Hi @Morganh , Thank for your help, my bad, still got another issues, but it works

Hi @Morganh , I can train DINO with my custom dataset, However, i got an error while training.
I create another topic. Looking for your help. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.